How to troubleshoot server getting overloaded after the event
Today I had a period of time where my server apparently got badly overloaded and I would like to troubleshoot things after the fact to see what exactly caused the problem. Shortly after 2pm all websites stopped responding and while I could initially get to the WHM login page once I entered my credentials the page started to time out. I got on my VPS's console and couldn't even get logged in there either. I managed to type in root but then couldn't get it to go anywhere after I typed in the password.
On the VPS panel I could see that the CPU usage was up around 85% and staying there. After about ten minutes I finally hit the restart button and a minute or so later everything came back to life.
This seemed like a DDOS attack so once I got back into WHM I went to the cloudlinux page to check the stats and it looks as though no one was hitting any resource limits during the outage so wondering if it was something else not related to web traffic? I don't know enough to know where to start looking...
I would like to know how best to troubleshoot this outage after the fact.
I was looking at the cpanel articles and tried the sar command. This is what it shows from when the attack was taking place:
14:10:01 14 724 1.90 2.09 2.15 0
14:20:01 15 733 2.58 2.16 2.10 0
14:30:01 15 763 2.25 2.19 2.13 0
14:40:01 5 707 2.24 2.52 2.36 014:40:01 runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 blocked
14:50:01 12 733 1.99 2.26 2.31 0
15:00:01 16 783 2.43 2.23 2.22 0
15:21:40 17 1078 10.46 11.30 8.63 37
Average: 5 717 2.35 2.39 2.35 115:43:32 LINUX RESTART (32 CPU)
15:50:01 runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 blocked
16:00:01 18 712 0.50 0.71 0.56 0
16:10:01 13 728 0.81 0.79 0.65 0
16:20:01 13 700 0.72 0.88 0.77 0
16:30:01 1 647 0.72 0.79 0.79 1
Some more info I have found. When the attack started I see hundreds of firewall messages in the system log within milliseconds of each other ending finally like this:
May 22 15:21:40 mmmm kernel: Firewall: *TCP_IN Blocked* IN=eth0 OUT= MAC=aa:9f:71:4f:17:f7:5c:5e:ab:43:85:f0:08:00 SRC=196.251.116.138 DST=45.58.xx LEN=40 TOS=0x00 PREC=0x00 TTL=243 ID=54321 PROTO=TCP SPT=60282 DPT=81 WINDOW=65535 RES=0x00 SYN URGP=0
May 22 15:21:40 mmmm kernel: Firewall: *TCP_IN Blocked* IN=eth0 OUT= MAC=aa:9f:71:4f:17:f7:5c:5e:ab:43:85:f0:08:00 SRC=147.185.133.23 DST=45.58.xx LEN=44 TOS=0x00 PREC=0x00 TTL=246 ID=54321 PROTO=TCP SPT=50932 DPT=8863 WINDOW=65535 RES=0x00 SYN URGP=0
May 22 15:21:40 mmmm kernel: Firewall: *TCP_IN Blocked* IN=eth0 OUT= MAC=aa:9f:71:4f:17:f7:5c:5e:ab:43:85:f0:08:00 SRC=167.94.146.16 DST=45.58.xx LEN=60 TOS=0x00 PREC=0x00 TTL=59 ID=52928 PROTO=TCP SPT=56189 DPT=104 WINDOW=42340 RES=0x00 SYN URGP=0
May 22 15:21:40 mmmm pdns_server[477217]: Limit of simultaneous TCP connections reached - raise max-tcp-connections
... repeated 11 times ...
---- then a bunch of these ----
May 22 15:21:40 mmmm p0f[477644]: [!] WARNING: Too many host entries, deleting 1001. Use -m to adjust.
-- and this----
May 22 15:21:41 mmmm systemd[1]: Finished User Runtime Directory /run/user/1056.
May 22 15:21:41 mmmm systemd-coredump[4019844]: Process 477392 (dovecot) of user 0 dumped core.#012#012Stack trace of thread 477392:#012#0 0x00007f4160a8b53c __pthread_kill_implementation (libc.so.6 + 0x8b53c)#012#1 0x00007f4160a3e686 raise (libc.so.6 + 0x3e686)#012#2 0x00007f4160a28833 abort (libc.so.6 + 0x28833)#012#3 0x00007f4160cf9fd1 fatal_handler_real.cold (libdovecot.so.0 + 0x5cfd1)#012#4 0x00007f4160da6717 i_syslog_fatal_handler (libdovecot.so.0 + 0x109717)#012#5 0x000055ae1d6ef50b master_fatal_callback (dovecot + 0x950b)#012#6 0x00007f4160cf960e i_panic (libdovecot.so.0 + 0x5c60e)#012#7 0x000055ae1d6ee361 service_status_more.cold (dovecot + 0x8361)#012#8 0x000055ae1d6f68d8 service_process_idle_kill_timeout (dovecot + 0x108d8)#012#9 0x00007f4160dc0d7b io_loop_handle_timeouts (libdovecot.so.0 + 0x123d7b)#012#10 0x00007f4160dc2e20 io_loop_handler_run_internal (libdovecot.so.0 + 0x125e20)#012#11 0x00007f4160dc2f44 io_loop_handler_run (libdovecot.so.0 + 0x125f44)#012#12 0x00007f4160dc3100 io_loop_run (libdovecot.so.0 + 0x126100)#012#13 0x00007f4160d2f1a7 master_service_run (libdovecot.so.0 + 0x921a7)#012#14 0x000055ae1d6eef2b main (dovecot + 0x8f2b)#012#15 0x00007f4160a295d0 __libc_start_call_main (libc.so.6 + 0x295d0)#012#16 0x00007f4160a29680 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x29680)#012#17 0x000055ae1d6ef3b5 _start (dovecot + 0x93b5)#012ELF object binary architecture: AMD x86-64
May 22 15:21:41 mmmm systemd[1]: Starting User Manager for UID 0...
May 22 15:21:41 mmmm systemd[1]: Starting User Manager for UID 1021...
May 22 15:21:41 mmmm systemd[1]: Starting User Manager for UID 1056...
-------And this kind of thing continues until a few seconds later I start getting the too many host entries thing. This all continues until I restart the server then everything goes back to normal.
My server is a VPS running CloudLinux with WHM/cpanel.
-
Hey there! It's always much harder trying to get details after the issue has resolved itself than during the actual problem. There are some logs on the server, such as the sar logs, that may be able to provide you with helpful data about the problem or at least point you in the right direction.
I'd start with our list of log data and troubleshooting steps here:
https://support.cpanel.net/hc/en-us/articles/360056001894-How-to-diagnose-high-server-loads
If I could be more specific I would, but you'll have to poke around until you find a specific process or tool or user that sticks out.
0
Please sign in to leave a comment.
Comments
1 comment