How to Efficiently Map Kubernetes Pods to Selected Nodes
For a general code deployment, in a Kubernetes-driven environment, we often may not care which nodes the pods are deployed. After all, Kube-scheduler does a good job selecting a matching node that has sufficient computing resources such as memory, CPU, storage, or network and that is ranked highest in the node-selection process. But what if we need to retain that control? This article will look at the ways to achieve this.
Ways to handle pod/node mapping
A common use case is increasing resource utilization efficiency by preventing the light pods, say the ones without any GPU execution requirement, from running on costly heavy-duty (with GPUs) nodes. To implement this mechanism, we can use the following strategies:
A taint is a tag containing a key/value with an effect (e.g., NoExecute, NoSchedule, etc.) applied on a node. The pods with only those tags, i.e. tolerations, could operate on a tainted node. We can taint a costly GPU node/ node group so that CPU-intensive pods without toleration cannot operate there. However, the caveat is, that there is no guarantee that GPU-intensive pods are expected to operate only on GPU nodes – they could try to do so on CPU nodes and thus block the other pods from executing and lead to resource under-utilization (with GPU nodes remaining idle). We can test this strategy on a local minikube instance, where there may not be any GPU allocated, but labeling (taint) is possible. Any pod without that label will fail to schedule/execute on minikube and remain in a pending state perpetually.
- Node Affinity:
We can force (required/preferred) a pod to run only on a set of nodes through node affinity. Now, the GPU-intensive pods with the same tag as the node’s label can operate on that node. But even here, nothing is stopping a user to launch a notebook server with mismatched pod/node mapping!
- Mutating Webhook:
We deployed the above mechanisms in a client’s Kubeflow (Kubernetes on the cloud) platform. The Kubeflow provides an option to launch a notebook server with prompts to choose toleration and an affinity. Now the issue is, that a user can still launch a server without choosing those fields – sure, her server pod most probably will not get scheduled on a node since an affinity is in place, but this will create more confusion. It would be better to map the pods to nodes in a way that would require minimal user intervention – welcome, dynamic admission controllers i.e. webhooks! A validation webhook can accept/reject the user request to access a node, but cannot alter the request. This is a simple solution to reject all the GPU requests coming from the CPU-intensive nodes; however, the user still needs to launch another request to schedule/execute the pods. In contrast, a mutating webhook can modify a user request through a patch. This patch will contain pod/node mapping. Thus, the set-up will now direct the user to the correct node: GPU pods to the GPU nodes and CPU pods to the CPU pods, even after selecting the wrong toleration initially.
Achieving optimality in execution on a limited pool of resources is a daunting task. If the nodes are severely underutilized, system admins may decommission them. However, if the workload varies unpredictably (especially during the beginning/ exploration phase of a project), reducing the resource pool may not be always feasible since it can lead to process starvation. The above approaches work best when the admins can estimate the workload fairly. In that case, mutating webhook can provide an efficient way to utilize expensive system resources.