What is Kubernetes Autoscaling?
Kubernetes lets you automate many management tasks, including provisioning and autoscaling. Instead of manually allocating resources, you can create automated processes that save time, let you respond quickly to peaks in demand, and conserve costs by scaling down when resources are not needed.
There are three Kubernetes autoscaling mechanisms:
- Horizontal Pod Autoscaler (HPA) – referred to as “scaling out,” dynamically increases or decreases the number of running pods as your application’s usage changes.
- Vertical Pod Autoscaler (VPA) – also referred to as “scaling up,” adds more resources (such as CPU or memory) to an existing machine.
- Cluster Autoscaler – while horizontal and vertical pod autoscaling (HPA and VPA) handle scaling at the application level, Cluster Autoscaler handles autoscaling at the infrastructure level by automatically increasing and decreasing the number of nodes available in the cluster for your pods to run on. (A node in Kubernetes is a physical or virtual machine.)
Horizontal Pod Autoscaler (HPA)
Horizontal scaling means that the response to increased load is to deploy more pods. This is different from vertical scaling, which for Kubernetes would mean assigning more resources (for example: memory or CPU) to the Pods that are already running for the workload.
When the level of application usage changes, the HorizontalPodAutoscaler automatically updates a workload resource (such as a Deployment or StatefulSet), with the aim of automatically scaling the workload to match demand.
If the load decreases, and the number of Pods is above the configured minimum, the HorizontalPodAutoscaler instructs the workload resource (the Deployment or StatefulSet) to scale back down.
Horizontal pod autoscaling does not apply to objects that can’t be scaled, for example a DaemonSet.
Kubernetes horizontal pod autoscaling is not a continuous process but runs at scheduled intervals with a default interval of 15 seconds. The interval is set by the
During each interval, resource utilization is queried and compared against the metrics specified in the HPA definition such as per-pod resource metrics (like CPU), object metrics and external metrics. A common method is to configure HPA to fetch metrics from aggregated APIs provided by an add-on Metrics Server, which needs to be launched separately.
The HPA controller manager compares actual resource utilization to the metrics defined for each HPA, and will increase and decrease the number of workload resources as needed to maintain the defined service level metrics.
If possible, try to prioritize custom metrics over external metrics. The external metrics API is a security risk because it can provide access to a large number of metrics, whereas a custom metrics API that holds just the specific metrics that you need offers a much lesser risk if it become compromised.
You should also consider using HPA together with Cluster Autoscaler (see below). In this way, scalability of pods can be coordinated with the nodes in the cluster. So for example, when you need to scale up, the Cluster Autoscaler can add eligible nodes, and when scaling down, it can shut down unneeded nodes to conserve resources.
Vertical Pod Autoscaling (VPA)
The Vertical Pod Autoscaler (VPA) adds or decreases the CPU and memory reservations that are allocated to pods. In this way, VPA can free up resources for other pods to serve users. VPA configuration can define which pods are eligible for vertical scaling, and when scaling should be applied.
VPA helps to size pods for the optimal CPU and memory resources required by the pod. VPA can be configured to provide recommended values for CPU and memory requests and limits, or to automatically update the values.
It’s important to set the correct resource requests and limits for your workloads for stability and cost efficiency. If the pod resource allocations are less than required by the workloads, the application can be throttled or it can fail completely due to out-of-memory errors. If resource sizes are too large, then you will have waste and increased costs.
VPA can both down-scale pods that are over-requesting resources, and also up-scale pods that are under-requesting resources based on their usage over time.
Vertical Pod Autoscaling (VPA) consists of 3 components:
- Recommender – monitors the current and past resource usage and recommends CPU and memory requests values for the container.
- Updater – checks which of the managed pods have correct resources set and kills those that do not so that they can be recreated with updated values.
- Admission Plugin – sets the correct resource requests on new pods that have just been created or recreated by the Updater.
Vertical Pod Autoscaling (VPA) provides these benefits:
- Cluster nodes are used efficiently because pods use exactly what they need.
- Pods are scheduled onto nodes that have the appropriate resources available.
- No need to run time-consuming benchmarking tasks to determine the correct values for CPU and memory requests.
- Maintenance time is reduced because VPA can automatically adjust CPU and memory requests over time.
Vertical Pod Autoscaling Limitations
Updating running pods is an experimental feature of VPA and whenever VPA updates the pod resources the pod is recreated, which causes all running containers to be restarted. The pod may then be recreated on a different node.
VPA reacts to most out-of-memory events, but not in all situations, and VPA performance has not been tested in large clusters.
VPA should not be used with the Horizontal Pod Autoscaler (HPA) on CPU or memory at this moment. However, you can use VPA with HPA on custom and external metrics.
VPA and HPA are incompatible and should not be used together for the same set of pods, unless you configure the HPA to use either custom or external metrics.
VPA recommendations may occasionally exceed the available resources which can cause pods to go into a pending status. This behaviour can be partly mitigated by using VPA together with Cluster Autoscaler (see below) to spin up new nodes if required.
Cluster Autoscaler for Kubernetes Infrastructure Autoscaling
Kubernetes’s native horizontal and vertical pod autoscaling (HPA and VPA) handle scaling at the application level. However when it comes to the infrastructure layer, Kubernetes doesn’t carry out infrastructure scaling itself. The Cluster Autoscaler can help with scaling your container clusters.
Cluster Autoscaler is an open-source project that automatically scales a Kubernetes cluster based on the scheduling status of pods and resource utilization of nodes. If there are several pods that are unscheduled because of insufficient resources, Cluster Autoscaler will automatically add more nodes to the cluster using your cloud provider’s auto scaling capabilities–for example, Auto Scaling Groups (ASGs) and Spot Fleet within AWS or similar services in the case of Microsoft Azure or Google Cloud.
Cluster Autoscaler is constantly looking for pods that cannot be scheduled, while also trying to consolidate pods that are currently deployed on only a few nodes.
Despite this simple approach to auto scaling, configuring Cluster Autoscaler for optimal use can be complex. Users need to have a good understanding of their pods and container needs, and need to be aware of some of the limitations of Cluster Autoscaler.
Unschedulable pods are usually as a result of inadequate memory or CPU resources, or inability to match an existing node due to the pod’s taint tolerations (rules preventing a pod from scheduling on a specific node), affinity rules (rules encouraging a pod to schedule on a specific node), or nodeSelector labels. If a cluster contains unschedulable pods, the Cluster Autoscaler checks managed node pools to see if adding a node may unblock the pod. If so, and the node pool can be enlarged, it adds a node to the pool.
Cluster Autoscaler also scans a managed pool’s nodes for potential rescheduling of pods on other available cluster nodes.
To ensure resource availability, define at least one CPU for resource requests made to the cluster autoscaler pod. This is critical to ensure that the node running the cluster autoscaler pod has enough resources, otherwise the cluster autoscaler could become non responsive.
Other Kubernetes Scaling Mechanisms
There are other methods you can use to scale workloads in Kubernetes. Here are two common methods:
- DaemonSets – used to deploy background services across all pods in a selected set.
- ReplicaSets – used to create a specified quantity of identical pods.