If you’ve ever deployed to Kubernetes and seen users briefly hit a blank page or a 502 Bad Gateway error, you’re not alone. I ran into this exact issue on a production GKE cluster running a web frontend with multiple replicas behind a Google Cloud Load Balancer.

The symptom: during a rolling update, for a few seconds, some users would see a black screen with a “failed upstream” message. Then it would resolve on its own. Annoying, intermittent, and hard to reproduce locally.

The root cause is not one race condition but several, layered on top of each other. A preStop hook is the first line of defense, but it’s not the whole story.

How pod termination works Link to heading

When Kubernetes decides to terminate a pod, two things start happening in parallel:

  1. The container gets SIGTERM so it can shut down.
  2. The pod’s endpoint gets removed from Services and load balancers so traffic stops being routed to it.

These two paths don’t coordinate. Without a preStop hook, the timeline looks roughly like this:

t=0   Pod marked Terminating
      ├── SIGTERM sent immediately
      └── Endpoint removal starts (async)
              ├── In-cluster Service rules update
              └── Cloud LB stops routing (slowest)

If SIGTERM lands and the app starts shutting down before the cloud LB has finished removing the endpoint, the LB keeps sending traffic to a pod that’s already closing connections — and that’s where the 502s come from.

Why the timing is fragile Link to heading

There are actually three deregistration paths in flight when a pod terminates, each with its own latency:

  1. In-cluster Service routing. The Service endpoint update propagates to every node within a few seconds on a modern cluster — usually fast enough that this isn’t the binding constraint.

  2. Cloud LB endpoint deregistration. The slowest of the three. The cloud LB has to be told the endpoint is gone via an external API call, and that round-trip can take tens of seconds. On GKE with container-native load balancing, Google documents up to ~60s of NEG-removal latency.

  3. The pod’s own shutdown. Once SIGTERM fires, the container closes listeners and refuses new work — completely unaware of the other two clocks.

The 502s you see are the gap between path 3 (fast) and path 2 (slow). The fix is to give the slowest path enough time to win the race.

The fix: preStop hooks Link to heading

A preStop hook runs before SIGTERM is sent. The kubelet won’t dispatch SIGTERM until the hook completes (or the grace period expires). That gives you a controlled delay where the pod is still alive and listening, but is already in Terminating state — so the deregistration paths above can run to completion.

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 10"]

This changes the termination timeline to:

t=0    Kubernetes marks pod as Terminating
       ├── preStop hook starts (sleep 10)
       └── Pod removal from Endpoints begins (async)
               ├── kube-proxy updates iptables
               └── Load balancer health check detects pod is gone

t=10   preStop hook completes
       └── SIGTERM sent to container process
       └── Application begins graceful shutdown

By the time SIGTERM actually reaches your application, the load balancer has stopped sending traffic to it.

A note on distroless / shell-less images Link to heading

The ["/bin/sh", "-c", "sleep 10"] form silently fails on distroless images that have no shell — which I learned the hard way after migrating a frontend to a Docker Hardened Image. The kubelet tries to exec sh, gets ENOENT, emits a FailedPreStopHook event (worth grepping for in kubectl describe pod), and sends SIGTERM immediately. Same blip as before, just stealthier.

Two fixes:

  1. Use a binary that’s actually in the image. Whatever runtime your image ships — pick something that can sleep on its own and call it directly via exec.
  2. Use the native sleep lifecycle action, which doesn’t need any binary in the container at all. Introduced by KEP-3960 — alpha in 1.29, beta in 1.30, GA in 1.34:
lifecycle:
  preStop:
    sleep:
      seconds: 10

Connection draining: the missing piece Link to heading

A preStop sleep delays SIGTERM, but it doesn’t change what the GCLB does with in-flight connections when it finally decides to stop using the endpoint. By default on a GCP Application LB (which is what fronts Gateway / Ingress for HTTP workloads), Google ends existing connections “as quickly as possible” once an endpoint is removed — meaning a request mid-response can be cut off.

The fix is connectionDraining.drainingTimeoutSec on the backend service. With a draining timeout set, when the load balancer removes an endpoint it stops sending new requests, but lets in-flight ones complete (up to the timeout) before tearing the connection down. The default is 0, meaning no draining grace at all (GCP’s connection draining docs cover this).

How you configure it depends on your GKE networking choice:

  • Classic Ingress: BackendConfig.spec.connectionDraining.drainingTimeoutSec.
  • Gateway API: GCPBackendPolicy.spec.default.connectionDraining.drainingTimeoutSec (GCP’s Gateway resources guide).

A common starting point is 30 seconds — long enough to let typical responses finish, short enough that a deployment doesn’t take forever. The right value isn’t independent of preStop: Google’s NEG guidance recommends preStop ≥ drainingTimeoutSec + ~60s NEG-removal latency, so picking a draining timeout commits you to a matching preStop floor (more on that math below).

apiVersion: networking.gke.io/v1
kind: GCPBackendPolicy
metadata:
  name: frontend-backend-policy
spec:
  default:
    connectionDraining:
      drainingTimeoutSec: 30
  targetRef:
    group: ""
    kind: Service
    name: frontend

The full picture Link to heading

The preStop hook alone isn’t enough if your other settings don’t support it. A reasonable starting configuration looks like this:

spec:
  terminationGracePeriodSeconds: 120
  containers:
    - name: frontend
      lifecycle:
        preStop:
          sleep:
            seconds: 60
      startupProbe:
        httpGet:
          path: /api/healthz
          port: http
        initialDelaySeconds: 5
        periodSeconds: 5
        failureThreshold: 12
      readinessProbe:
        httpGet:
          path: /api/healthz
          port: http
        periodSeconds: 10
        failureThreshold: 3
      livenessProbe:
        httpGet:
          path: /api/healthz
          port: http
        periodSeconds: 30
        failureThreshold: 3

Let me explain each piece:

terminationGracePeriodSeconds — The total budget Kubernetes gives the pod to shut down. It must be longer than your preStop sleep + the time your application needs to drain connections. For NEG-backed services on GKE, Google’s guidance goes up to 3.5 minutes (210 seconds) paired with a 120-second preStop. The picking-the-numbers section below works through how to land on a value that matches your traffic.

startupProbe — This is the counterpart to preStop, but for pod startup. Without it, the readiness probe starts checking immediately, and a slow-starting application might fail health checks and get killed before it’s ready. The startup probe gives it up to 60 seconds (12 failures × 5s period) to initialize before liveness/readiness probes kick in.

readinessProbe on /api/healthz — Using a dedicated health endpoint instead of the root / path is important. The root path might render a full page, hit the database, or return a large response. A health endpoint should be a lightweight check that returns 200 when the app is ready to serve traffic. When the readiness probe fails, Kubernetes removes the pod from the Service endpoints — this is the in-cluster equivalent of what the load balancer does externally.

livenessProbe — This restarts the pod if the application becomes unresponsive. Keep the interval longer and the threshold higher than the readiness probe — you don’t want to restart pods aggressively, just catch actual deadlocks.

Application-level graceful shutdown Link to heading

Even with a perfect preStop and connection-draining setup, your application still has to shut down cleanly. Two general principles, regardless of language or framework:

  1. Install a SIGTERM handler that begins a drain, not a tear-down. Plenty of runtimes and web frameworks default to terminating the process the moment SIGTERM arrives, killing in-flight requests with it. Replace that default with a handler that stops accepting new connections, lets active ones finish (with a timeout), and only then exits.

  2. Treat idle and active connections separately during the drain. Idle keep-alive sockets should be closed immediately so clients reconnect against a healthy pod; active in-flight requests should be allowed to finish. Mixing the two — for example, slamming all connections shut at once — produces the same RST-induced 502s the LB-layer fixes were supposed to eliminate. The exact mechanics are framework-specific, but every serious web runtime exposes a way to do this.

A typical shutdown sequence looks like:

  1. SIGTERM arrives.
  2. Flip an in-memory shuttingDown = true flag.
  3. /api/healthz starts returning 503.
  4. Stop accepting new connections.
  5. Wait for active requests to drain (with a timeout).
  6. Exit.

A note on step 3: with periodSeconds: 10, failureThreshold: 3 (the example values above), the readiness probe takes up to 30 seconds to mark the pod NotReady — which on a Kubernetes-initiated termination is roughly the same speed as the natural endpoint removal that’s already happening, so flipping /api/healthz doesn’t really beat it. Where the trick does help is app-self-initiated drains (a scheduled in-process restart, a feature flag flip) where there’s no pod termination event for k8s to react to.

Steps 4–5 — explicit close-and-drain logic — are the part that actually closes the connection-reset race, and that’s where graceful-shutdown libraries earn their keep.

Rolling update strategy Link to heading

The deployment strategy matters too:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

maxUnavailable: 0 means Kubernetes must bring up a new pod and wait for it to pass readiness checks before terminating an old one. This guarantees that during a rollout, the total number of ready pods never drops below the desired count. Combined with the preStop hook, it means:

  1. New pod starts, passes startup probe, then readiness probe
  2. Old pod marked for termination
  3. preStop sleep gives the network time to deregister the old pod
  4. SIGTERM sent, application drains gracefully
  5. No moment where traffic has nowhere to go

GKE-specific: HealthCheckPolicy Link to heading

If you’re on GKE using the Gateway API, the cloud load balancer runs its own health checks independently of Kubernetes probes. By default, it might check the root path / or use settings that don’t match your readiness probe. You can align them with a HealthCheckPolicy:

apiVersion: networking.gke.io/v1
kind: HealthCheckPolicy
metadata:
  name: frontend-healthcheck
spec:
  default:
    checkIntervalSec: 10
    timeoutSec: 3
    healthyThreshold: 1
    unhealthyThreshold: 3
    config:
      type: HTTP
      httpHealthCheck:
        portSpecification: USE_SERVING_PORT
        requestPath: /api/healthz
  targetRef:
    group: ""
    kind: Service
    name: frontend

This ensures the GKE load balancer checks the same /api/healthz endpoint your Kubernetes probes use, with matching intervals. Without this, you can have a situation where Kubernetes thinks the pod is ready but the load balancer hasn’t marked it healthy yet (or vice versa).

Picking the sleep duration Link to heading

There’s an actual formula here, not just intuition. Google’s NEG guidance says:

preStop ≥ drainingTimeoutSec + ~60s NEG-removal latency

drainingTimeoutSec is the connection-draining timeout you configured on the BackendPolicy. The ~60s NEG-removal latency is Google’s stated worst-case for endpoint removal propagating to the LB. Add ~30–45 s of application-shutdown headroom and you arrive at Google’s headline numbers: 120s preStop, 210s terminationGracePeriodSeconds.

A 10-second sleep is a common starting point and gets rid of the obvious blips, but it sits well below the formula floor — under load, occasional sub-second blips will still show up, because NEG removal can take much longer than 10s. Anywhere in the 60–120s preStop / 30s drain / 120–210s grace range is a defensible setup; the right point on that range is a deployment-time-vs-tail-latency tradeoff, and you can move along it without changing anything else.

That deployment-time cost is real. With 10 replicas and a 120-second preStop, a sequential rolling update takes 20 extra minutes. In practice you tune this against your replica count and tolerance for occasional retries.

Summary Link to heading

The 502-during-deployment problem isn’t one race — it’s three: in-cluster Service routing, the cloud LB’s async endpoint deregistration, and the application’s own shutdown. The fix is to make sure each one finishes before the next stage begins:

  1. preStop sleep — delays SIGTERM so the network has time to catch up. Use the native sleep action (GA in k8s 1.34) to avoid shell-binary issues with distroless images.
  2. terminationGracePeriodSeconds — must be > preStop sleep + drain time. For NEG-backed GKE, Google recommends up to 210s.
  3. Connection draining via BackendConfig / GCPBackendPolicy — lets in-flight requests complete after the LB marks the endpoint gone. Default is 0 (no grace); set 30s and pair with preStop ≥ drainingTimeoutSec + ~60s.
  4. maxUnavailable: 0 — ensures new pods are ready before old ones terminate.
  5. Dedicated health endpoint — consistent checks across probes and load balancer.
  6. HealthCheckPolicy (GKE) — aligns cloud LB health checks with Kubernetes probes.
  7. Application-level SIGTERM handling — flip readiness to 503, drain idle connections, let in-flight requests finish, then exit. The whole infra stack can be perfect and still produce 502s if the application drops connections abruptly on SIGTERM.

A one-line preStop hook gets you most of the way. The remaining few percent — the blips that survive long enough to show up in error budgets — come from the cloud-LB and application layers, not the kubelet.