Monitor actual resource usage
This page shows you how to use Kueue labels assigned to pods to monitor resource usage.
The intended audience for this page are batch administrators.
Before you begin
Make sure the following conditions are met:
- A Kubernetes cluster is running.
- The kubectl command-line tool has communication with your cluster.
- Kueue is installed.
- The cluster has at least one local queue and cluster queue. Presented commands assume your local queue is named “user-queue” and cluster queue is named “cluster-queue”, if this is not the case adjust them appropriately.
Note
The queries in this page requireAssignQueueLabelsForPods Feature Gate, which is enabled by default.
If it is not enabled, see Installation for details how to enable it.
kubectl top for command line resource usage debugging
Warning
As mentioned on Metrics Server repository site, Metrics Server and provided command line tool -kubectl top - is not meant as a monitoring solution. The tool is convenient
for quick troubleshooting but is not a substitute for full-scale monitoring. See the Production resource monitoring section for a more robust setup.
- To use
kubectl topyou need to install metrics-server for you cluster. Follow the Metrics Server Installation - Schedule a couple of jobs that have some actual cpu usage to your local queue:
for i in {1..3}; do
kubectl create -f - <<'EOF'
apiVersion: batch/v1
kind: Job
metadata:
generateName: sample-job-
namespace: default
labels:
kueue.x-k8s.io/queue-name: "user-queue"
spec:
parallelism: 1
completions: 1
template:
spec:
containers:
- name: dummy-job
image: registry.k8s.io/e2e-test-images/agnhost:2.53
command: [ "/bin/sh" ]
args: [ "-c", "end_time=$(($(date +%s) + 60)); while [ $(date +%s) -lt $end_time ]; do :; done"]
resources:
requests:
cpu: "1"
restartPolicy: Never
EOF
done
- Monitor the usage of cpu and memory in nearly real time for this local queue with
kubectl top:
watch -n 15 'kubectl top pod --sum -l kueue.x-k8s.io/local-queue-name=user-queue'
You will see a list of pods currently running on the local queue named “user-queue” with their current cpu and memory measurements and the sum of the usage at the bottom. The list will be automatically refreshed every 15 seconds. The outcome should look like this:
Every 15.0s: kubectl top pod --sum -l kueue.x-k8s.io/local-queue-name=user-queue
NAME CPU(cores) MEMORY(bytes)
sample-job-jd2vm-9hpnr 661m 3Mi
sample-job-p8mxr-cld27 684m 2Mi
sample-job-wccrt-h4684 654m 0Mi
________ ________
1998m 6Mi
- Similarly, you can monitor the usage of the resources by jobs admitted to a cluster queue:
watch -n 15 'kubectl top pod --sum -l kueue.x-k8s.io/cluster-queue-name=cluster-queue'
Note
If you encounter an error like “error: Metrics API not available”, it may be caused by certificate verification problems in your cluster. You can skip the verification in metrics-server by editing the deploymentkubectl edit deployment metrics-server -n kube-system
and adding --kubelet-insecure-tls to the container arguments, however this
is highly discouraged in production environments.
Production resource monitoring
- Install prometheus and kube-state-metrics according to your cluster’s setup. Next, configure your
kube-state-metricsto allowlist Kueue pod labels.
The commands below assume you are using the kube-prometheus-stack Helm chart (which includes both tools). If you use a different setup, adjust your configuration accordingly. To proceed with the Helm chart, add the following fragment to your values.yaml file:
kube-state-metrics:
metricLabelsAllowlist:
- pods=[kueue.x-k8s.io/cluster-queue-name, kueue.x-k8s.io/local-queue-name]
Deploy prometheus stack helm chart with values.yaml:
helm install kube-prometheus-stack oci://ghcr.io/prometheus-community/charts/kube-prometheus-stack -f values.yaml
- If you do not have the Prometheus UI available, forward a port for the Prometheus service:
kubectl port-forward svc/kube-prometheus-stack-prometheus 9090:9090
-
Deploy some jobs, such as the sample jobs defined earlier.
-
Verify that the new labels are available by running a PromQL query in the Prometheus UI. (If you are not using port-forwarding, use your specific cluster address instead.)
kube_pod_labels{label_kueue_x_k8s_io_local_queue_name!=""}
- You can now join the queue labels with your existing pod resource metrics using group_left. For example, use this query to aggregate CPU usage by local queue:
sum by (label_kueue_x_k8s_io_local_queue_name) (
sum by (namespace, pod) (
rate(container_cpu_usage_seconds_total{container!="", pod!=""}[5m])
)
* on(namespace, pod) group_left(label_kueue_x_k8s_io_local_queue_name)
kube_pod_labels{label_kueue_x_k8s_io_local_queue_name!=""}
)
To filter or aggregate by cluster queue instead, replace the local queue label with label_kueue_x_k8s_io_cluster_queue_name.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.