Linux's OOM (Out Of Memory) killer, is a mechanism that the Linux kernel employs when the system is critically low on memory. When your system runs out of memory, it is the job of the Linux OOM killer to sacrifice one or more processes in order to free up memory for the system when all else fails.
This situation occurs because processes on the server are consuming a large amount of memory, and the system requires more memory to allocate to other more essential processes and keep the system operational. The solution that the Linux kernel employs is to invoke the OOM Killer to review all running processes and kill one or more of them in order to free up system memory and keep the system running. A specific heuristic/algorithm determines which process is the best candidate to get memory freed without damaging the basic operation of the system.
How Does it Select a Process to Kill?
The OOM Killer works by reviewing all running processes and assigning them a badness score. The process that has the highest score is the one that is killed. The OOM Killer assigns a badness score based on a number of criteria, the principal of which are as follows:
The process and all of its child processes are using excessive memory. ( Asking more memory than what's available on the system or/and refusing to release unused memory for the purpose of reclamation)
It is always preferable that the minimum number of processes are killed (ideally one).
Root, and important system processes are given much lower badness scores by default.
You can refer to this link for a short discussion on how this mechanism is implemented:
How do I Determine if OOM Killer is Responsible for a Process Being Killed?
You can run this command to see if a service/process, here MySQL for demonstration purposes, has been killed by OOM killer: (Replace MySQL with the name of the service/process)
grep -Ei "(out of memory|oom)" /var/log/messages* -A 1 | grep -i mysql
Note: Different systems have different log format entries for processing being killed by OOM and it will not be very practical to list every possible format here. The best way is to investigate the system log files (/var/log/messages OR journalctl) directly and search for the strings "oom" or "out of memory" and compare the timestamp with the time that the associated service was reported to fail.
What Is The Solution?
There are a few options to consider. You can first tweak the OOM killer. This is a more technically demanding route to take generally and might not necessarily resolve the issue and if not executed properly has the potential to even exacerbate the situation. This link can help, should you decide to go with that option:
Another option would be to first identify and then limit those processes on your system with excessive memory usage (AKA with higher badness score), which also are not very essential to the operation of the server. By limiting these processes manually, we preclude the possibility of the OOM killer striking in the first place and in this way safeguard the system from its unexpected consequences. For demonstration purposes, please refer to this link to see an example of general performance tuning practices for a usually memory-intensive service, namely the Apache's PHP-FPM service :
And yet another option, possibly the easiest one, would be to increase the memory on your system. You can see the current memory situation for the system by running this command:
total used free shared buff/cache available
Mem: 31G 18G 8.7G 8.7M 4.0G 12G
Swap: 4.9G 53M 4.9G
Total: 36G 18G 13G
The following example references are useful in further understanding and troubleshooting the Linux OOM Killer:
Out of memory: Kill process or sacrifice child
How to Configure the Linux Out-of-Memory Killer:
Capacity Tuning (overcommit_memory, Out-of-Memory Kill Tunables)
Out Of Memory Management
Overcommit Memory in SUSE Linux Enterprise Server