Tracking down source of load spike
I had an email alert that my server load was 75+ :-O By time I got to my PC to see what was up, though, the load was already back down to normal.
Looking at the email, I have apachestatus.html. In that, I see:
0-0 | 11008 | 0/166/33797 | R | 75.74 | 0 | 2453 | 547158 | 0.0 | 1.97 | 409.81 | 172.70.134.134 | h2 | example.com:443 | [1/0] read: stream 0, |
1-0 | 13430 | 0/4/37803 | _ | 2.10 | 0 | 243 | 670625 | 0.0 | 0.01 | 456.70 | 123.45.67.89 | http/1.1 | example.com:443 |
GET /legit_page HT |
Above that was 24 more connections, all with what appears to be a load of 75+.
Am I right that the connection to /legit_page had a normal load of 2.10, but then the Cloudflare connection (172.70.134.134) to read:stream suddenly caused it to spike to 75.74?
If so, any suggestions on tracking down what could have caused it to spike?
-
It is very difficult to tell after the fact. You can try looking in /var/cpanel/dcpumon to see if there are any 'top' snapshots from that period.
0 -
Unfortunately, I don't see a dcpumon... neither a file nor a directory :-/
I installed sys-snap.pl, though, and that has a lot of data! I'm sure it would be helpful if I knew what to look for! LOL
Just a few minutes ago my load was at about 70, so I restarted Apache and it went back down. I looked at the log from sys-snap at that time, and I have this:
20231204 13 44 Load Average: 62.25 35.52 15.91 1/641 7680
Memory Usage:
MemTotal: 8388608 kB
MemFree: 1268296 kB
MemAvailable: 5930512 kB
Cached: 3041276 kB
Buffers: 0 kB
Active: 3702112 kB
Inactive: 1576520 kB
Active(anon): 1453272 kB
Inactive(anon): 903460 kB
Active(file): 2248840 kB
Inactive(file): 673060 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 2097152 kB
SwapFree: 1727392 kB
Dirty: 624 kB
Writeback: 0 kB
AnonPages: 2356732 kB
Shmem: 119376 kB
Slab: 1773032 kB
SReclaimable: 1740316 kB
SUnreclaim: 32716 kBVirtual Memory Stats:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 369760 1267708 0 4781592 0 0 21 14 0 1 4 1 96 0 0
3 0 369760 1255240 0 4781540 0 0 4 1588 0 1299 3 0 97 0 0
3 0 369760 1249200 0 4781492 0 0 0 1052 0 1136 2 0 98 0 0
0 0 369760 1242272 0 4781428 0 0 8 1592 0 1040 1 0 99 0 0
1 0 369760 1235608 0 4781372 0 0 44 2916 0 1130 1 0 99 0 0
2 0 369760 1226260 0 4781348 0 0 20 1832 0 1566 1 0 99 0 0
1 0 369760 1220480 0 4781280 0 0 4 1308 0 883 1 0 99 0 0
0 0 369760 1215608 0 4781260 0 0 8 1328 0 1073 2 0 98 0 0
0 0 369760 1193308 0 4781236 0 0 16 1352 0 1396 1 0 98 0 0
0 0 369760 1202988 0 4781248 0 0 12 532 0 1260 1 0 99 0 0Under "Process list", the highest was /usr/sbin/mysqld at 13.9% of CPU and 16.4% of memory (fairly normal). Then I had a few httpd processes that were a little high:
nobody 7326 8.7 0.8 2999384 74072 ? Sl 13:44:20 00:00:00 \_ /usr/sbin/httpd -k start
nobody 7340 3.5 0.7 1817680 66856 ? Sl 13:44:20 00:00:00 \_ /usr/sbin/httpd -k start
nobody 7353 2.9 0.7 2278488 60524 ? Sl 13:44:20 00:00:00 \_ /usr/sbin/httpd -k start
nobody 7356 9.0 0.8 2933848 71864 ? Sl 13:44:20 00:00:01 \_ /usr/sbin/httpd -k start
nobody 7393 6.8 0.8 3261528 67744 ? Sl 13:44:20 00:00:00 \_ /usr/sbin/httpd -k start
nobody 7618 9.8 0.8 2999384 74424 ? Sl 13:44:21 00:00:00 \_ /usr/sbin/httpd -k startJust before those heavy httpd processes, though, I had this:
root 7321 0.0 0.0 12744 2740 ? S 13:44:19 00:00:00 \_ /usr/local/cpanel/bin/splitlogs --dir=/etc/apache2/logs/domlogs --main=my.server.com --suffix=-bytes_log
root 7322 0.0 0.0 12744 2584 ? S 13:44:19 00:00:00 \_ /usr/local/cpanel/bin/splitlogs --dir=/etc/apache2/logs/domlogs --main=my.server.com --mainout=/etc/apache2/logs/access_log
nobody 7323 0.0 0.3 272936 26464 ? S 13:44:19 00:00:00 \_ /usr/sbin/httpd -k start
root 7324 2.6 0.1 53268 13864 ? S 13:44:19 00:00:00 \_ /usr/local/cpanel/3rdparty/bin/perl /usr/local/cpanel/bin/leechprotectCould splitlogs or leechprotect be the source of the problem?
0 -
It's incredibly unlikely for either of those to be the source of the issue.
sys-snap only works if it is installed during the time the issue happens - you can't install it after the problem has already happened and expect to get useful data.
I find the sar logs in /var/log/sa to be the most helpful when trying to track down load issues after they have happened.
0 -
sys-snap only works if it is installed during the time the issue happens - you can't install it after the problem has already happened and expect to get useful data.
Yeah, but that's not an issue in this case. I'm getting these spikes about 6-8 times /day! In fact, just a few minutes after I made the previous post, I had yet another spike. And again, splitlogs was running at the exact same time.
Interesting to note that access_log finally rotated at 7:49pm, and I haven't had a load spike since. But that also means that it took at least 6 hours for the 10MB log to rotate.
I remember several years ago having some sort of problem with log rotation, and the recommendation was to lower the threshold setting to a lower number so that it wouldn't take so long to complete. I've had it set to 10MB ever since, and there's no practical reason that it would take more than a few seconds to rotate a 10MB file.
0 -
I agree, but if the I/O was THAT high, it could take a while. I would reach out to the host to have them check the hardware on the machine - since there haven't been any changes to the server, failing hardware could be an issue.
0 -
FWIW, there doesn't appear to be any hardware issues. And I think that I've ruled out splitlogs and leechprotect as issues.
I started another thread on this, so this is mainly posted for posterity. Best I can tell it's "fixed" by enabling RLimitMEM. I think that's more of a Band-aid than a fix, but without being able to find the source of the problem this is the closest fix I've found.
0
Please sign in to leave a comment.
Comments
6 comments