Kubernetes HPA: Scale Apps for 2026 Resiliency

Listen to this article · 12 min listen

Mastering how-to tutorials for implementing specific scaling techniques is non-negotiable for any modern tech professional; without it, your applications will crumble under pressure, leaving users frustrated and your business vulnerable. But what if I told you there’s a straightforward path to achieving robust, high-performance systems?

Key Takeaways

  • Implement Kubernetes Horizontal Pod Autoscaler (HPA) to automatically scale applications based on CPU utilization or custom metrics, ensuring efficient resource allocation.
  • Configure AWS Auto Scaling Groups (ASG) to dynamically adjust EC2 instance counts, maintaining application availability and performance under varying loads.
  • Utilize Redis for distributed caching to significantly reduce database load and improve response times for frequently accessed data.
  • Adopt a stateless application architecture to facilitate horizontal scaling, making it easier to add or remove instances without complex session management.
  • Regularly conduct load testing with tools like JMeter to identify performance bottlenecks and validate scaling configurations before production deployment.

I’ve seen countless projects fail not because of flawed code, but because their scaling strategy was an afterthought. The truth is, scaling isn’t just about throwing more hardware at a problem; it’s about intelligent, strategic design. Today, we’re going to walk through implementing a specific, powerful scaling technique: Horizontal Pod Autoscaling (HPA) in Kubernetes, coupled with a smart caching layer. This isn’t just theory; this is how we build resilient, high-traffic applications.

1. Set Up Your Kubernetes Cluster and Deploy a Sample Application

Before we can scale, we need something to scale. We’ll assume you have a Kubernetes cluster running. If you don’t, I highly recommend using Google Kubernetes Engine (GKE) or Amazon EKS for production environments – their managed services handle a lot of the underlying complexity. For this tutorial, we’ll deploy a simple Nginx application that simulates a CPU-intensive workload.

First, ensure your kubectl is configured to connect to your cluster. You can test this by running kubectl cluster-info. If you get connection errors, resolve those before proceeding.

Next, create a deployment file named nginx-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cpu-nginx-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cpu-nginx
  template:
    metadata:
      labels:
        app: cpu-nginx
    spec:
      containers:
  • name: nginx-container
image: nginx:latest ports:
  • containerPort: 80
resources: requests: cpu: "100m" memory: "128Mi" limits: cpu: "200m" memory: "256Mi"

Apply this deployment to your cluster:

kubectl apply -f nginx-deployment.yaml

Now, expose this deployment with a service:

apiVersion: v1
kind: Service
metadata:
  name: cpu-nginx-service
spec:
  selector:
    app: cpu-nginx
  ports:
  • protocol: TCP
port: 80 targetPort: 80 type: LoadBalancer # Use NodePort for local dev, LoadBalancer for cloud

Apply the service:

kubectl apply -f nginx-service.yaml

Wait for the service to get an external IP (if using LoadBalancer) or expose the NodePort. You can check its status with kubectl get svc cpu-nginx-service.

Screenshot description: A terminal window showing the output of kubectl get pods and kubectl get svc, confirming the deployment and service are running and the service has an external IP.

Pro Tip: Always define resource requests and limits for your containers. This is foundational for Kubernetes scheduling and, more importantly, for HPA to make intelligent scaling decisions. Without them, Kubernetes doesn’t know how much CPU or memory your application needs or can consume, making auto-scaling based on resource metrics ineffective.

2. Install Metrics Server (If Not Already Present)

The Horizontal Pod Autoscaler (HPA) relies on metrics to make scaling decisions. Specifically, it needs CPU and memory usage data from your pods. The Kubernetes Metrics Server provides these metrics. Many managed Kubernetes services, like GKE and EKS, include this by default. However, if you’re running a self-managed cluster or a different provider, you might need to install it.

You can check if it’s running by trying to fetch metrics:

kubectl top pods

If you get an error like “Error from server (NotFound): the server could not find the requested resource (get pods.metrics.k8s.io)”, then Metrics Server isn’t installed. To install it, you can use the official YAML from its GitHub repository:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

It usually takes a few minutes for the Metrics Server pod to start and begin collecting data. You can monitor its status with kubectl get pods -n kube-system | grep metrics-server.

Screenshot description: A terminal output showing successful application of the metrics-server components.yaml, followed by kubectl top pods displaying CPU and MEMORY usage for running pods.

Common Mistake: Forgetting to install or verify the Metrics Server. HPA will simply not work without it, and you’ll spend hours debugging why your pods aren’t scaling. Always confirm kubectl top pods works before configuring HPA.

3. Configure Horizontal Pod Autoscaler for CPU Utilization

Now for the exciting part: telling Kubernetes to automatically scale your application. We’ll configure HPA to scale our cpu-nginx-deployment based on CPU utilization.

Create a file named hpa.yaml:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: cpu-nginx-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: cpu-nginx-deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
  • type: Resource
resource: name: cpu target: type: Utilization averageUtilization: 50

Let’s break this down:

  • scaleTargetRef: This points to the deployment we want to scale (cpu-nginx-deployment).
  • minReplicas: 1: The minimum number of pods HPA will maintain, even under zero load.
  • maxReplicas: 10: The maximum number of pods HPA will scale up to. This is crucial for cost control and preventing runaway scaling.
  • metrics: Here we define the scaling triggers. We’re using a Resource metric, specifically cpu.
  • target.averageUtilization: 50: This means HPA will try to keep the average CPU utilization across all pods of this deployment at 50% of their requested CPU. If it goes above, it scales out; if below, it scales in.

Apply the HPA configuration:

kubectl apply -f hpa.yaml

You can check the status of your HPA with:

kubectl get hpa

Initially, it will show “unknown” for CPU utilization as it gathers data. After a minute or two, it should start showing current utilization.

Screenshot description: A terminal output showing the hpa.yaml file contents, followed by kubectl apply -f hpa.yaml and then kubectl get hpa displaying the HPA status with TARGET and CURRENT values.

Pro Tip: Don’t set averageUtilization too low (e.g., 20%) or too high (e.g., 90%). Too low means you’re over-provisioning and wasting resources. Too high means you’re always on the brink of overload. 50-70% is often a good starting point, but this requires testing and observation specific to your application’s behavior.

HPA Scaling Strategies for 2026 Resiliency
CPU Utilization

85%

Memory Consumption

78%

Custom Metrics

65%

Queue Length

72%

Network I/O

58%

4. Generate Load to Test HPA

To see HPA in action, we need to generate some load that increases the CPU utilization of our Nginx pods. A simple way to do this is by using a tool like hey (formerly bombardier) or ApacheBench (ab). If you don’t have hey, you can install it via go get github.com/rakyll/hey or download a pre-compiled binary.

First, get the external IP of your Nginx service:

kubectl get svc cpu-nginx-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

Let’s assume the IP is YOUR_EXTERNAL_IP. Now, run hey to generate concurrent requests:

hey -n 100000 -c 500 http://YOUR_EXTERNAL_IP

This command sends 100,000 requests with 500 concurrent connections. Monitor your HPA status in a separate terminal:

watch kubectl get hpa cpu-nginx-hpa

You should observe the “CURRENT” replicas increasing as the CPU utilization rises above 50%. Once you stop the load, the HPA will eventually scale down the pods back to minReplicas (1 in our case), though there’s a cool-down period before scaling in.

Screenshot description: One terminal running the hey command against the service IP, showing request statistics. Another terminal showing the output of watch kubectl get hpa cpu-nginx-hpa, with the REPLICAS column dynamically increasing from 1 to 2, 3, or more.

Editorial Aside: I’ve had clients who thought they could get away with manual scaling for “predictable” traffic. We ran into this exact issue at my previous firm with a Black Friday promotion. Their “predictable” traffic spike turned into a tsunami, and by the time they manually scaled, the customer experience was already severely degraded. Automated scaling is not a luxury; it’s a necessity for anything beyond trivial workloads.

5. Implement a Distributed Caching Layer with Redis

While HPA handles compute scaling, many applications hit bottlenecks at the database layer. A robust scaling strategy often includes a distributed caching layer. Redis is my go-to for this—it’s fast, versatile, and incredibly popular. We’ll deploy a standalone Redis instance in our cluster.

Create redis-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-deployment
spec:
  selector:
    matchLabels:
      app: redis
  replicas: 1
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
  • name: redis
image: redis:6.2.6 # Pin to a specific version ports:
  • containerPort: 6379
resources: requests: cpu: "50m" memory: "64Mi" limits: cpu: "100m" memory: "128Mi"

And redis-service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: redis-service
spec:
  selector:
    app: redis
  ports:
  • protocol: TCP
port: 6379 targetPort: 6379 type: ClusterIP # Internal service, not exposed externally

Apply these:

kubectl apply -f redis-deployment.yaml
kubectl apply -f redis-service.yaml

Now, your application pods can connect to Redis using the service name redis-service on port 6379. You’d typically integrate a Redis client library into your application code (e.g., node-redis for Node.js, StackExchange.Redis for .NET, redis-py for Python) to store and retrieve frequently accessed data, reducing the load on your primary database.

Screenshot description: A terminal showing kubectl apply -f redis-deployment.yaml and kubectl apply -f redis-service.yaml, followed by kubectl get pods -l app=redis and kubectl get svc redis-service confirming Redis is running.

Common Mistake: Treating caching as a silver bullet. While Redis is incredibly powerful, poorly implemented caching can lead to stale data or cache stampedes. Design your cache invalidation strategies carefully and understand what data is truly cacheable. Not everything needs to be cached, and not everything should be.

6. Advanced HPA: Custom Metrics (Optional but Powerful)

While CPU utilization is a common scaling metric, it’s not always the best indicator of application performance. Sometimes, you need to scale based on other factors, like queue length, number of active connections, or API request latency. This is where custom metrics come in.

To use custom metrics, you typically need to install the Kubernetes Custom Metrics API server and integrate it with a monitoring solution like Prometheus. This is a more advanced setup, but it offers unparalleled flexibility. For example, you could scale based on the length of a Kafka queue if your application processes messages.

Here’s a conceptual hpa-custom-metric.yaml:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: queue-processor-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: queue-processor-deployment
  minReplicas: 1
  maxReplicas: 20
  metrics:
  • type: Pods
pods: metric: name: messages_in_queue # This would be a metric exposed by Prometheus target: type: AverageValue averageValue: 10 # Scale if average messages per pod exceed 10

Setting this up involves:

  1. Deploying Prometheus and configuring it to scrape your application’s custom metrics endpoints.
  2. Deploying the Custom Metrics API server and configuring it to connect to Prometheus.
  3. Defining your HPA to use the custom metric.

I had a client last year, a financial tech startup in Midtown Atlanta, whose primary bottleneck was processing incoming transaction requests. Their CPU wasn’t spiking, but their RabbitMQ queue was growing uncontrollably during peak hours. We implemented custom metrics HPA based on queue depth, and it transformed their system from constantly backlogged to smoothly processing thousands of transactions per second. The key was identifying the actual bottleneck, not just the obvious one. This kind of nuanced real scaling tech leads to real growth.

Screenshot description: A conceptual diagram showing the flow: Application -> Prometheus Exporter -> Prometheus -> Custom Metrics API Server -> HPA -> Kubernetes Scaling.

Implementing effective scaling with HPA and caching layers demands a clear understanding of your application’s behavior and bottlenecks. It’s about proactive design, not reactive firefighting. By following these structured steps, you can build systems that not only handle current demand but effortlessly adapt to future growth. For more insights on building resilient systems, consider how to bust common scaling myths to achieve your goals for 2026.

What is the difference between horizontal and vertical scaling?

Horizontal scaling (scaling out) means adding more machines or instances to distribute the load. For example, adding more web servers to handle more requests. Vertical scaling (scaling up) means adding more resources (CPU, RAM) to an existing single machine. Horizontal scaling is generally preferred for cloud-native applications because it offers greater resilience and flexibility.

How does Kubernetes Horizontal Pod Autoscaler (HPA) decide when to scale?

HPA primarily relies on metrics to make scaling decisions. The most common metrics are CPU utilization and memory utilization. It compares the current average utilization across pods to a defined target utilization percentage. If the current average exceeds the target, HPA scales out (adds more pods); if it falls below, it scales in (removes pods), after a cool-down period.

What are the common pitfalls when implementing HPA?

Common pitfalls include not setting appropriate resource requests and limits on pods, which prevents HPA from accurately calculating utilization. Another issue is setting minReplicas or maxReplicas incorrectly, leading to either under-provisioning or excessive costs. Finally, not having the Metrics Server installed or using an unstable custom metrics source will render HPA ineffective.

Why is a caching layer like Redis important for scaling?

A caching layer like Redis is crucial because while HPA scales your application’s compute, it doesn’t directly scale your database. Databases are often the bottleneck in high-traffic applications. By caching frequently accessed data in Redis, you significantly reduce the number of queries hitting your database, improving response times and allowing your database to handle more complex or less frequent operations efficiently.

Can HPA scale based on non-resource metrics, like queue length?

Yes, HPA can scale based on custom metrics, such as queue length, HTTP request rate, or network I/O. This requires integrating a monitoring solution like Prometheus and deploying the Kubernetes Custom Metrics API server. This advanced setup provides much finer-grained control over scaling behavior, allowing you to react to application-specific bottlenecks beyond just CPU or memory.

Cynthia Johnson

Principal Software Architect M.S., Computer Science, Carnegie Mellon University

Cynthia Johnson is a Principal Software Architect with 16 years of experience specializing in scalable microservices architectures and distributed systems. Currently, she leads the architectural innovation team at Quantum Logic Solutions, where she designed the framework for their flagship cloud-native platform. Previously, at Synapse Technologies, she spearheaded the development of a real-time data processing engine that reduced latency by 40%. Her insights have been featured in the "Journal of Distributed Computing."