Kubernetes preStop hooks for fun and profit

TIL about container lifecycle hooks in Kubernetes, and more specifically how preStop hooks can be used to avoid downtime during deployments.

Somewhat simplified, when new pods are rolled out during a deployment Kubernetes tries to balance the number of available pods by shutting down and starting up one replica at a time. A gotcha in the process is that the component responsible for managin the pod lifecycle is independent of the component responsible for routing traffic to pods. In practice, this means that if a pod shuts down before the routing has been updated, you can get requests routed to pods that have already been killed. This is likely to result in timeouts, 503s, and the sort.

To alleviate this, one common approach is to add a shutdown delay to your applications. Here, a signal handler is added to you application that catches SIGTERM and delays the normal shutdown procedure by some time (e.g. 5 seconds). This enables the application to keep responding to new requests until the ingress controller has had time to deregister the pod. After the configured delay, normal shutdown procedure is initiated, rejecting new requests and completing in-flight requests before shutting down.

It turns out that in Kubernetes there's a simpler approach. In your Deployment (or wherever you specify the container configuration) it's possible to set a preStop lifecycle hook that will run before the shutdown signal is sent to the pod. Here it's possible to simply wait for some duration before continuing:

lifecycle:
  preStop:
    exec:
      command: ["sleep", "5"]

If you're running Kubernetes v1.33 or greater, or have the PodLifecycleSleepAction feature gate enabled, there's an equivalent hook handler implementation that let's you avoid having to have the sleep binary in your container image:

lifecycle:
  preStop:
    sleep:
      seconds: 5

This is not a foolproof solution, but it's a common enough thing to do that Kubernetes decided to add a hook handler specifically for this. One problem with it is that it cannot be known exactly how long the ingress controller will take to update its registrations. There might be a more complicated way around this that waits until the pod has been deregistered by checking the Kubernetes API, but I haven't looked further into it. Another thing to be aware of is that the runtime of the preStop hook is considered as part of the pod shutdown grace period. This means that if you have a sleep 30 while your terminationGracePeriodSeconds remains its default value of 30, your pod will have 0 seconds to perform its shutdown procedure and will be forcefully killed by Kubernetes.