Handling OOMKills in Kubernetes
Starting in kubernetes 1.28 (and more specifically, when using cgroups v2 on the nodes), the behavior of the linux OOMKiller changed in a way that broke many workloads.
- previously: if a process was oomkilled, only that process died, the rest of the pod’s process tree were not terminated or killed
- now: if a process was oomkilled, all the processes in the pod’s process tree are killed by the OOMKiller
The change in question changed the OOMKiller to kill the process group rather than just the individual process.
Why did this break so many workloads (see the discussions on the PR)? Many application designs use child processes to do the heavy lifting for resource heavy processing. If the child process was killed by the OOMKiller, the parent process would be configured to cleanup (and possibly retry!). With this new change, the entire application would get killed. Not just terminated (SIGTERM
, which gives the application time to cleanup) but killed (SIGKILL
, which immediately halts execution and gives 0 time for the process to shut down gracefully), potentially leaving the application in an unclean state.
As an example of how this would be managed in python before the change, lets take a look at this code:
import logging
import subprocess
logger = logging.getLogger(__name__)
# lets assume ls leads to oomkill
p = subprocess.Popen("ls -lah")
p.wait()
if p.returncode == -9:
logger.error("process killed by oom")
raise Exception("child process got oomkilled")
A child process killed by the oomkiller would return an exit code of -9
allowing this code to handle the situation gracefully.
Kubernetes offers no way to change this behavior in versions between 1.28 and 1.32. In 1.32, a kubelet level flag is available that allows for changing this behavior back to the original. However, the (now default) behavior does make sense, especially since most workloads are unlikely to handle OOMKills correctly. What we need is for a way for applications that know what they’re doing to catch a process about to run out of resources before it triggers the OOMKiller.
Handling OOMKills by avoiding the OOMKiller
When looking at different ways to achieve this goal, I came across the ulimit ulitily in linux. This utility allows a process to set soft and hard limits on its resource usage, for various resources. The system calls used to do this is setrlimit, which is also used by language level libraries (e.g. Python’s resource module).
The interesting options for this syscall are (from the man page):
RLIMIT_AS
The maximum size of the process’s virtual memory (address space) in bytes. This limit affects calls to brk(2), mmap(2) and mremap(2), which fail with the error ENOMEM upon exceeding this limit. Also automatic stack expansion will fail (and generate a SIGSEGV that kills the process if no alternate stack has been made available via sigaltstack(2)). Since the value is a long, on machines with a 32-bit long either this limit is at most 2 GiB, or this resource is unlimited.
RLIMIT_DATA
The maximum size of the process’s data segment (initialized data, uninitialized data, and heap). This limit affects calls to brk(2) and sbrk(2), which fail with the error ENOMEM upon encountering the soft limit of this resource.
RLIMIT_RSS
Specifies the limit (in pages) of the process’s resident set (the number of virtual pages resident in RAM). This limit only has effect in Linux 2.4.x, x < 30, and there only affects calls to madvise(2) specifying MADV_WILLNEED.
I looked at each of these options and it seemed like the best option is RLIMIT_AS
TODO: explain why.
Example: Using setrlimit
in Python
I tested this specifically in Python using the resource module. Python’s subprocess
module methods for executing code in another process allows for specification of a function (prexec_fn
) that runs in a process just after its spawned, giving us the perfect place to inject these resource constraints.
import logging
import resource
import subprocess
logger = logging.getLogger(__name__)
LIMIT_1G = 1_000_000_000 # approx 1GB
def preexec_fn():
resource.setrlimit(resource.RLIMIT_DATA, (LIMIT_1G, LIMIT_1G))
p = subprocess.Popen(
"ls -lah",
preexec_fn=preexec_fn,
) # lets assume ls leads to oomkill
p.wait()
# we would need to wrap whatever python process we are running to return a
# unique code for running into ENOMEM, which is raised as a `MemoryError`
# exception in python. Lets say its set to -225.
if p.returncode == -225:
logger.error(f"process ran out of memory: {LIMIT_1G}")
raise Exception("child process ran out of memory")
Using a version of this for the app I was working on prevented OOMKills entirely, allowing us to gracefully detect and handle when the child process ran out of memory.