Interpreting common exit codes in Kubernetes You Need to Know

zouyee

2024-03-06

Interpreting common exit codes in Kubernetes

In this article, we will delve into Kubernetes exit codes 127 and 137, explaining what they are, common causes in K8s and Docker, and how to fix them!

We will cover:

History of exit codes
Exit code 127
Exit code 137

The History of Exit Codes

The history of process exit codes can be traced back to the early days of the Unix operating system. In Unix systems, a process exit code is an integer value passed to its parent process upon termination, used to indicate the termination status of the process. This integer value typically falls between 0 and 255, where 0 indicates successful termination, and other values are usually used to represent different errors or exceptional conditions.

Process exit codes were initially designed to provide a simple mechanism for parent processes to understand the outcome of their child processes’ execution. This allows parent processes to take appropriate actions based on the exit code of the child process, such as handling error conditions or continuing with other operations.

In Unix systems, specific exit code values often have specific meanings, such as:

0: Indicates successful execution without errors.
1: Typically signifies a general error.
2: Indicates a syntax error in the command.
127: Indicates command not found.

Over time, with the development of Unix operating systems and different implementations, the meanings of process exit codes may vary, but the basic concept remains unchanged.

In Linux systems, the usage of process exit codes is similar to Unix systems. Linux inherits Unix’s process management mechanism and extends and improves upon it. Therefore, process exit codes remain an important concept in Linux, used to aid in understanding and diagnosing the execution status of processes.

The history of process exit codes can be traced back to the early days of Unix systems and is an important concept in both Unix and Linux operating systems, providing a simple yet effective mechanism for inter-process communication. When an application or command terminates or fails to execute due to a fatal error, it produces exit codes in the 128 series (128+n), where n represents the signal number. n includes all types of termination codes, such as SIGTERM, SIGKILL, etc.

Exit Code 127

Exit code 127 is not a Kubernetes-specific error code, but rather a standard exit code used in Linux and similar Unix-like operating systems. However, it is frequently encountered in Kubernetes and typically indicates that a command or binary executed within a container could not be found.

Some standard exit codes include:

Exit Code	Explanation
0	Command executed successfully
1	General error
2	Misuse of shell builtins
126	Permission denied, command invoked cannot execute
127	Command not found, incorrect PATH
128+n	Command terminated by signal, fatal error encountered
>255	Exit codes beyond 255 will be re-calculated (mod 256)

Let’s take a look at some common reasons for exit code 127:

Command or binary not installed

The executable specified in the command field of a Kubernetes container is not installed in the container’s file system. Ensure that the required binary or command is available.
Incorrect path or command

The command specified in the Pod definition is incorrect or does not exist in the specified path. This is one of the most common errors, often caused by incorrect input in the Dockerfile or pod spec’s entrypoint or command.
Missing dependencies

The application or script running inside the container lacks necessary dependencies. Ensure that all required dependencies are included in the container image.
Shell interpreter

If a script is specified as a command, ensure that the script is valid (e.g., #!/bin/bash) and available in the container.
Shell script syntax error

If the shell script exits with code 127, check if there are syntax errors in the script or issues preventing its execution.
Insufficient permissions

The user running the command inside the container may not have the necessary permissions to execute the specified command. Ensure that the container runs with appropriate privileges.
Image compatibility issues

Ensure that the container image used is compatible with the host architecture and operating system. Mismatched images may result in commands not being found, such as running an x86 image on an ARM machine.
Volume mounts

If the command relies on files mounted from a volume, check if the volume mount is configured correctly and the required files are accessible.
Environment variables

Some commands may depend on specific environment variables. Ensure that the necessary environment variables are set correctly.
Kubernetes RBAC policies

If RBAC is enabled, ensure that the necessary permissions are granted to execute the specified command.

How to Troubleshoot

To diagnose the issue, you can use the following commands to check the logs of the Pod:

1	kubectl logs -f

You can also inspect the Pod’s status, which provides detailed information about the Pod, including its current state, recent events, and any error messages.

1	kubectl describe pod

Additionally, you can attach a debugging container to the Pod, which includes a shell (e.g., BusyBox). This allows you to enter the container and manually check the availability of environment, paths, and commands.

Example of debugging with BusyBox:

containers:
  - name: test
    image: test
    command: ["/bin/sleep", "36000"]
  - name: debug
    image: busybox
    command: ["/bin/sh"]

If you are using a higher version of Kubernetes, you can also utilize Ephemeral Containers, which are temporary containers. This is a new feature introduced as alpha in Kubernetes v1.16, and enabling the feature of Ephemeral Containers is straightforward. Just configure --feature-gates=EphemeralContainers=true in the kube-api and kubelet services, then restart.

By carefully examining the logs and investigating in the above directions, you should be able to determine the cause of the exit code 127 issue.

How to Fix

Now that we know the common causes of exit code 127 and how to troubleshoot them, let’s see how to fix them.

Command or binary not installed

If the required command or binary is missing, it may need to be installed in the container image. Modify the Dockerfile or the build process to install the necessary software.

Example:

1 2	FROM alpine:latest RUN apk --no-cache add <package>

Incorrect path or command

When specifying commands in the Pod definition, consider using the absolute path of the binary. This helps ensure that the binary is found by the runtime, regardless of the current working directory.

Example:

containers:
  - name: test
    image: test
    command: ["/usr/bin/command"]

Missing dependencies

The reason for the command not running may be that additional software needs to be installed in the container image. If the command requires additional setup or installation steps, you can use init containers to perform these tasks before the main container starts.

Example (installing a package using init container):

initContainers:
  - name: install-package
    image: alpine:latest
    command: ["apk", "--no-cache", "add", "<package-name>"]
    volumeMounts:
    - name: shared-data
      mountPath: /data

Shell interpreter

If a script is specified as a command, ensure that the script is valid (e.g., #!/bin/bash) and available in the container.

Example:

#!/bin/sh

Volume mounts

Check the Pod’s configuration to ensure that volumes are correctly mounted. Verify that the volume names, mount paths, and subPaths are correct.

Example:

volumes:
  - name: test
    emptyDir: {}
containers:
  - name: test
    image: test
    volumeMounts:
    - name: test
      mountPath: /path in container

Additionally, confirm that the volume specified in the Pod definition exists and is accessible. If it’s a persistent volume (PV), check its status. If it’s an emptyDir or other types of volumes, verify that they are created and mounted correctly. If subPaths are used in the volume mount, ensure that the specified subPaths exist in the source directory or file.

Example:

volumeMounts:
  - name: test
    mountPath: /path in container
    subPath: my-file.txt

Exit Code 137

In Kubernetes, the exit code 137 indicates that the process was terminated forcibly. In Unix and Linux systems, when a process is terminated due to a signal, the exit code is determined by adding the signal number to 128. Since the signal number for “SIGKILL” is 9, adding 128 to 9 results in exit code 137.

When a container exceeds its memory limit in a Kubernetes cluster, it may be terminated by the Kubernetes system with an “OOMKilled” error, indicating that the process was terminated due to insufficient memory. The exit code for this error is 137, where OOM stands for “out-of-memory”.

If the Pod state shows as “OOMKilled”, you can check it using the following command:

1	kubectl describe pods PODNAME

OOMKiller

OOMKiller is a mechanism in the Linux kernel responsible for preventing the system from running out of memory by terminating processes that consume too much memory. When the system runs out of memory, the kernel invokes OOMKiller to select a process to terminate in order to free up memory and keep the system running.

There are two different OOM Killers in the kernel; one is the global OOM Killer, and the other is the OOM Killer based on cgroup memory control, which can be either cgroup v1 or cgroup v2.

In summary, when the kernel encounters issues allocating physical memory pages, the global OOM Killer is triggered. When the kernel attempts to allocate memory pages (whether for kernel use or for processes needing pages) and initially fails, it tries various ways to reclaim and consolidate memory. If this attempt is successful or at least makes some progress, the kernel will continue retrying allocation (from the code I can see); if it fails to free pages or make progress, it will often trigger the OOM Killer in many cases.

Once the OOMKiller selects a process to terminate, it sends a signal to that process requesting it to gracefully terminate. If the process does not respond to the signal, the kernel forcibly terminates the process and releases its memory.

Note: Pods terminated due to memory issues may not necessarily be evicted from the node; if their restart policy is set to “Always”, they will attempt to restart the Pod.

At the system level, the Linux kernel maintains an oom_score for each process running on the host. The likelihood of a process being terminated depends on how high the score is.

The oom_score_adj value allows users to customize OOM processes and define when processes should be terminated. Kubernetes uses the oom_score_adj value when defining the Quality of Service (QoS) of Pods.

Kubernetes defines three types of QoS for Pods, each with a corresponding oom_score_adj value:

Guaranteed: -997
BestEffort: 1000
Burstable: min(max(2, 1000 — (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)

Where Pods of Guaranteed QoS have an oom_score_adj value of -997, so they are the last ones to be terminated when the node runs out of memory. BestEffort Pod configurations have an oom_score_adj value of 1000, so they are the first ones to be terminated.

To check the QoS of a Pod, you can use the following command:

1	kubectl get pod -o jsonpath='{.status.qosClass}'

Here’s the calculation policy for defining Pods of Guaranteed QoS type:

Each container in the Pod must have memory limits and memory requests.
For each container in the Pod, the memory limit must be equal to the memory request.
Each container in the Pod must have CPU limits and CPU requests.
For each container in the Pod, the CPU limit must be equal to the CPU request.

Exit code 137 typically has two scenarios:

First and foremost, the most common cause is related to resource constraints. In this scenario, Kubernetes typically exceeds the memory allocation limit of the container. When this happens, it terminates the container to ensure the stability of the node.

The other scenario involves manual intervention - a user or a script may send a “SIGKILL” signal to the container process, resulting in this exit code. OOMKilled (exit code 137)

How to Troubleshoot

Check Pod logs

The first step in diagnosing OOMKilled errors is to check the Pod logs for any error messages indicating memory issues. The events section of the describe command will provide further confirmation and the time/date of the error occurrence.

kubectl describe pod <podname>
State:          Running
       Started:      Fri, 12 May 2023 11:14:13 +0200
       Last State:   Terminated
       Reason:       OOMKilled
       Exit Code:    137
       ...

You can also query the Pod logs:

1	cat /var/log/pods/<podname>

Of course, you can also do it via (standard output)

1	kubectl logs -f <podname>

Monitor memory usage

Monitor memory usage in Pods and containers using monitoring systems like Prometheus or Grafana. This can help us identify which containers are consuming too much memory and triggering OOMKilled errors, and you can use dmesg on the container host to check the scene of the oomkiller.

Use memory profilers

Use memory profilers like pprof to identify memory leaks or inefficient code that may be causing excessive memory usage.

How to Fix

Below are common causes of OOMKilled Kubernetes errors and their solutions.

Container memory limit reached

This may be due to improper setting of the memory limit value specified in the container. The solution is to increase the value of the memory limit or investigate the root cause of the increased load and correct it. Common causes of this situation include large file uploads, as uploading large files can consume a significant amount of memory resources, especially when multiple containers are running in a single Pod, and sudden increases in traffic volume.

Application memory leak, container memory usage reaches the upper limit

Debug the application to locate the cause of the memory leak.

Total memory used by all Pods exceeds available node memory

Increase the available memory of the node by increasing the memory of the node, or migrate Pods to nodes with more memory. Alternatively, adjust the memory limits of Pods running on the node to comply with memory constraints.

References

https://spacelift.io/blog/oomkilled-exit-code-137

https://spacelift.io/blog/exit-code-127

https://cloud.tencent.com/developer/news/1152344

https://utcc.utoronto.ca/~cks/space/blog/linux/OOMKillerWhen