Web Hosting

Exit code 137 doesn’t mean what Docker implies it means

The container had been running cleanly for three weeks. Then it started dying every few hours. Not crashing in any visible way. The process would just stop, the status would flip to Restarting, and thirty seconds later it was back up as if nothing had happened.

The first thing you do is check the logs.

docker logs --tail 100 my-app

Nothing. No error, no stack trace, no “out of memory” message. The logs end mid-operation, as if someone cut the power. You check docker inspect for the exit code and get 137.

Exit code 137 is 128 plus 9. The 9 is SIGKILL. Something sent the hardest possible termination signal to the container’s main process. Docker reports this code and then, by default, tells you nothing else about why.

The instinct at this point is to treat it as a Docker problem. Redeploy. Check the image. Look for a bug in the application. I went through all of that. The container kept dying on the same schedule.

The log Docker doesn’t show you

The kernel keeps its own record of every process it kills. Docker doesn’t surface this in docker logs. You have to go looking for it separately.

sudo dmesg | grep -i "killed process"

On the server where I saw this, that command returned:

Out of memory: Killed process 14821 (postgres) total-vm:1048576kB, anon-rss:614400kB

Docker didn’t kill the container. The Linux kernel did. The OOM killer, which is a mechanism that exists in every Linux distribution and has nothing to do with Docker specifically, fired because the system ran out of RAM. It picked the process with the highest memory footprint, sent it SIGKILL, and moved on. Docker saw the exit code, marked the container as stopped, and since restart: always was set, brought it back up.

Which is where the actual problem starts.

What “restart: always” does to an OOM kill

The intent behind restart: always is reasonable. If a container crashes due to a bug or a transient failure, automatic restart is useful. What it does badly is respond to OOM kills.

The container restarts. It loads back into memory. The application initializes. If the memory condition that caused the kill hasn’t changed, and it hasn’t, the same pressure builds again. In thirty minutes, or an hour, or three hours depending on the workload, the OOM killer fires again. Same process, same signal, same restart. The cycle continues until either the workload drops or someone notices.

During that cycle, every restart writes logs, burns CPU, and potentially corrupts data in mid-operation. A PostgreSQL instance killed mid-write will run crash recovery on the next start, which takes time and consumes more memory than normal startup. The restart that was supposed to be the safety net is actively making the memory situation worse.

Why Docker’s defaults make this invisible

Docker, by default, sets no memory limit on any container. Run docker stats on a server with multiple containers and look at the MEM LIMIT column. On a server where limits haven’t been explicitly configured, every container shows the total host RAM as its ceiling. That means any single container can consume the entire system’s memory before anything intervenes at the container level.

The OOM killer doesn’t care about container boundaries. It looks at the full process list, calculates a badness score for each process based on memory consumption and a few other factors, and kills the one with the highest score. On a VPS running four containers with no limits set, the killer might target the database. Or the reverse proxy. Or the Docker daemon itself. It doesn’t ask Docker which container is most expendable.

Setting a memory limit changes the behavior completely. When a container exceeds its configured limit, Docker kills only that container before the kernel has to intervene. The rest of the stack stays alive. The blast radius shrinks from “possibly the whole server” to “this one container.”

services:
  app:
    deploy:
      resources:
        limits:
          memory: 512M

This is not a performance tuning option. It’s the difference between a container dying in isolation and a container taking down a database that has nothing to do with the problem.

What was actually consuming the RAM

In the incident above, the culprit was a build process running on the same server as the production containers. A docker compose up --build was triggered by a deployment pipeline, and the Node.js dependency installation spiked memory by several hundred megabytes for the duration of the build. That spike pushed the server over the edge. PostgreSQL, which had been sitting at stable memory usage for weeks, got killed because it happened to have the highest badness score at the moment the kernel needed to free something.

The build process finished fine. It had already released its memory by the time I started investigating. Nothing in the deployment logs flagged a problem. The only evidence of what happened was in dmesg.

This is common enough that it has a category. A build process, a document import job, or a machine learning inference container loading a model at startup can all create temporary memory spikes that kill unrelated services. The kernel doesn’t know or care that the memory spike was transient.

Swap is not a fix, but it buys time

Adding swap space doesn’t solve the root problem. If a container is consistently consuming all available RAM, swap will slow the kill down but not prevent it. Sustained heavy swap usage turns an SSD into a bottleneck and makes the whole server feel stuck.

What swap does well is handle the transient spike. A build that runs for two minutes and needs an extra gigabyte during compilation is exactly the case where swap is useful. The kernel moves some inactive pages to disk, the build finishes, the pressure drops, and normal operation resumes without anyone getting killed.

On a VPS with no swap at all, any memory spike that exceeds available RAM triggers the OOM killer immediately. There’s no buffer. Adding two gigabytes of swap on SSD storage costs almost nothing and changes that behavior.

sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Add the line /swapfile none swap sw 0 0 to /etc/fstab to make it persistent across reboots.

The diagnostic path, distilled

When a container dies with exit code 137, the sequence that actually tells you what happened is three commands. First, confirm Docker’s account of the kill:

docker inspect CONTAINER_NAME --format='{{.State.OOMKilled}} {{.State.ExitCode}}'

If OOMKilled is true, the kernel is responsible, not a bug in the application. If it’s false with exit code 137, something else sent SIGKILL. Check your CI/CD pipeline, monitoring scripts, or anyone with server access.

Second, find the kernel record:

sudo dmesg | grep -i "killed process"

This tells you exactly which process was killed and how much RAM it had consumed. Multiple entries mean the killer has fired more than once, which means the server is under sustained memory pressure, not just a one-time spike.

Third, look at what’s running right now:

docker stats --no-stream

Check MEM LIMIT. If every container shows the full host RAM as its limit, you haven’t set any limits. That’s the first thing to fix before anything else.

The complete guide to diagnosing and fixing Docker OOM kills on a VPS covers PostgreSQL connection limits, Java heap sizing, Redis maxmemory, and the monitoring setup that catches memory pressure before the OOM killer has to.

The thing Docker doesn’t tell you

Container abstraction is useful precisely because it hides the operating system from you. The problem is that the OOM killer lives in the operating system layer, not the Docker layer. When it fires, Docker reports the symptom (exit code 137) without reporting the cause, because Docker didn’t cause it and doesn’t have visibility into why the kernel acted.

The result is that exit code 137 gets treated as a Docker problem, or an application problem, or a mystery. docker logs shows nothing. The application didn’t log anything because it was killed before it could. The only witness is dmesg, and most Docker tutorials don’t mention it.

Set memory limits on every container running in production. Not because the application is misbehaving, but because the kernel’s fallback for unbounded containers is to pick one at random and kill it. That’s not a container management strategy.

Leave a Reply

Your email address will not be published. Required fields are marked *