Three HPA Pitfalls I Hit on GKE with Flux and Linkerd

I added Horizontal Pod Autoscalers to a production GKE cluster running Flux CD for GitOps and Linkerd as a service mesh. The HPA configuration itself was straightforward — the problems came from how these three systems interact.

Here are three pitfalls I hit, each requiring a separate fix.

1. Flux keeps resetting the replica count Link to heading

The first HPA I deployed looked like this:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: frontend
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: frontend
  minReplicas: 5
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Within minutes, the HPA scaled the frontend from 5 to 7 replicas based on CPU load. Then Flux ran its next reconciliation cycle, saw that the Deployment manifest in Git said replicas: 5, and reset it back. The HPA immediately scaled up again. Flux reset it again. A tug-of-war.

The problem: both Flux and the HPA think they own the spec.replicas field on the Deployment. Flux enforces the Git state, and the HPA enforces the metric-based target. They conflict.

The fix: remove the replicas field from the Deployment manifest entirely — both from the base and any overlay patches:

# Before
spec:
  replicas: 5  # remove this
  template:
    ...

Flux applies manifests with server-side apply, where every field has a manager. With no replicas in the rendered manifest, Flux never claims ownership of the field. The HPA writes to it, becomes the field manager, and Flux leaves it alone on every reconciliation. The HPA is the sole owner of the replica count, and Flux still manages everything else in the Deployment spec.

(If you’re coming from Argo CD looking for ignoreDifferences: Flux’s Kustomization CRD doesn’t have that field — it’s Argo-specific. If replicas is set by an upstream chart you can’t edit, use spec.patches on the Flux Kustomization to strip it at render time.)

2. Linkerd sidecar breaks the CPU metric Link to heading

After fixing the Flux issue, I noticed the HPA was calculating CPU utilization incorrectly. It was reporting much lower values than what kubectl top pods showed for the actual application container.

The HPA was configured with type: Resource, which calculates CPU utilization as:

utilization = (total pod CPU usage) / (total pod CPU requests)

The key word is pod. That includes all containers — not just the application container. Linkerd injects a linkerd-proxy sidecar into every pod, and that sidecar didn’t have a CPU request set. When a container has no CPU request, the HPA’s metric calculation breaks — it either ignores that container’s usage or computes the ratio incorrectly.

The result: the HPA thought CPU utilization was lower than it actually was and didn’t scale up when it should have.

The fix: switch from Resource to ContainerResource metrics, targeting the specific application container:

metrics:
  - type: ContainerResource
    containerResource:
      name: cpu
      container: frontend  # target only the app container
      target:
        type: Utilization
        averageUtilization: 70
  - type: ContainerResource
    containerResource:
      name: memory
      container: frontend
      target:
        type: Utilization
        averageUtilization: 80

The ContainerResource metric type was introduced in Kubernetes 1.20. Instead of summing all container resources in the pod, it looks at a single named container. This makes the HPA oblivious to sidecars — it measures exactly the container you care about.

This applies to any multi-container pod where sidecars don’t have resource requests. If you’re running Linkerd, Istio, or any other service mesh with auto-injected sidecars, use ContainerResource instead of Resource for your HPA metrics.

3. PodDisruptionBudget blocks scale-down Link to heading

When I added PodDisruptionBudgets, I initially set them with minAvailable:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: frontend
spec:
  minAvailable: 6
  selector:
    matchLabels:
      app: podarujdobro
      component: frontend

This seemed reasonable — always keep at least 6 pods running during voluntary disruptions (node drains, rolling updates). But it created a problem with the HPA.

When the HPA wanted to scale down from, say, 8 replicas to 5, the PDB blocked it. The PDB said “you must keep at least 6 pods available,” so Kubernetes refused to terminate pods below that threshold. The HPA’s desired state and the PDB’s constraint were in direct conflict.

The fix: switch from minAvailable to maxUnavailable:

spec:
  maxUnavailable: 1

maxUnavailable: 1 means “during a voluntary disruption, at most 1 pod can be unavailable at a time.” This protects against too many pods going down simultaneously (the original goal) without imposing a floor on the total replica count. The HPA can scale down to any number, and the PDB only constrains how many pods can be disrupted at once during that process.

The final configuration Link to heading

After all three fixes, here’s what the HPA and PDB look like:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: frontend
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: frontend
  minReplicas: 3
  maxReplicas: 15
  metrics:
    - type: ContainerResource
      containerResource:
        name: cpu
        container: frontend
        target:
          type: Utilization
          averageUtilization: 70
    - type: ContainerResource
      containerResource:
        name: memory
        container: frontend
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: frontend
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: podarujdobro
      component: frontend

And the Deployment manifests omit spec.replicas entirely, so the HPA cleanly owns that field via server-side apply field management.

Each of these pitfalls took a separate commit to diagnose and fix. None of them showed up in local testing — they only manifest when Flux, the HPA, and Linkerd are all running together in a real cluster with actual traffic.