Value for RLimitMEM
I've been fielding major server spikes for weeks now! It makes no sense, one request will have a server load at 0.6 and then the next request spikes to 150+ :-O
I've not been able to find a pattern for it, other than seeing httpd processes with a high CPU usage and long run time; eg,
# ps aux --sort -pcpu | head -5nobody 17875 18.9 3.0 4113696 257972 ? Sl 13:56 0:39 /usr/sbin/httpdnobody 17912 19.0 3.7 4113696 313392 ? Sl 13:56 0:39 /usr/sbin/httpd
I'm looking at using RLimitMEM to limit how much memory a single process can eat. WHM recommends a setting of 796M; is that a generic number, or is it calculated based on something in my server?
-
Rather than focusing on the mem limit value I would try to catch whats going on. When you see the spike login to WHM and check Apache status, make a screenshot and see the number of requests. Usually these are sort attacks.
Andrew N. - cPanel Plesk VMWare Certified Professional
Do you need immediate assistance? 20 minutes response time!*
EmergencySupport - Professional Server Management and One-time Services0 -
I've been trying to track it down for several weeks! Months, really :-/
At first it looked like the issue was a ton of requests on a specific page, coming from an Amazon bot. So I used Cloudflare to rate limit that page and thought it fixed it, but no.
I've set up CF rules to block attacks and thought that fixed it, but no.
Today is when I noticed that I consistently had 2 httpd connections taking up a ton of CPU, and they both had long run times. I traced one PID back to Google Imagebot, but several hours prior! It looks like they started the connection but, for whatever reason, it never ended. So I tried turning off KeepAlive to force new connections; I still had long running connections, but spikes were WAY worse!
That led me to RLimitMEM. Since I enabled that my load hasn't breached 15, which is why I thought there might be a logical way to configure it.
0 -
I dont think it is related to RLimitMEM but maybe others have some different ideas as well.....
Andrew N. - cPanel Plesk VMWare Certified Professional
Do you need immediate assistance? 20 minutes response time!*
EmergencySupport - Professional Server Management and One-time Services0 -
It's been 24 hours, and I haven't had a server spike since enabling RLimitMEM! It's too early to say that this "fixed" anything, but before enabling this yesterday I had 8 spikes over 30, and 2 more between 10 and 30 :-O
I'm sure that this is just a band-aid, but I've had no luck tracking down the actual source so this is a big relief.
0 -
Spoke too soon! I just spiked big time :-( By time I got WHM > Apache Status to load the spike was going down, though.
Looking at Apache Status, I copied the list of connections to Excel so that I can sort them.
If I sort by CPU, the top 3 IPs all have the same PID but belong to Cloudflare!
PID Acc M CPU SS Req Dur Conn Child Slot Client Protocol Request 9816 1/305/16930 G 152.36 0 44 403203 21.4 6.06 332.65 172.70.174.61 h2 [1/1] done 9816 1/306/16736 G 152.36 0 1 361284 21.7 5.64 332.16 172.70.42.237 h2 [1/1] done 9816 0/310/16878 G 151.37 1 193 468429 0 10.97 338.37 172.70.43.81 h2 [1/0] read: stream 0, I tried to use lsof -p 9816, but by this time the PID had already been shut down by WHM or Apache so there were no results.
Unsorted, I see the load jumped from 3.63 to 21.63 in a single request; those were different PIDs, but both are Amazon (Cloudflare?) IPs:
12486 0/36/16425 _ 21.63 0 28 264878 0 0.72 318.63 52.70.240.171 http/1.1 12794 0/6/15879 _ 3.63 7 27 317376 0 0.15 315.86 52.70.240.171 http/1.1 0 -
Have you checked the Raw Apache log in the time ranges that the spike has been registered or only lsof? Unfortunately with these kind of things it is most times a lot of manual work and sometimes not what you expect.
For example we suffered a spike on CPU a month ago than proceeded to increase the physical cores because we were close to BlackFriday and simply could not permit user delays. Than once again hit the limit even though the previous spikes where below the new CPU count by a large margin. Turns out it was actually a case of to strict CPU settings on a package that caused the spike from the CloudLinux LVE CageFS settings. We relaxed the restrictions from 200% to 300% on the package and the spike was gone on the entire network.0 -
Have you checked the Raw Apache log in the time ranges that the spike has been registered or only lsof? Unfortunately with these kind of things it is most times a lot of manual work and sometimes not what you expect.
The problem has to be with Cloudflare. I've been fielding server spikes ALL day, going from 0.68 to 150 in less than a second!
I have sys-snap.pl running, and it shows that, at the most recent spike, I had 1,163 active connections; after de-duping, I had 388 unique IPs.
Of those, 148 belong to Amazon and 235 belong to Cloudflare.
I see a few local IPs in there, too, including my own. So I don't know if those Amazon/Cloudflare IPs are a result of an attack coming through their cache, or if it's their bot.
I started blocking connections at Cloudflare when the Hostname contains Amazon or Cloudflare. That worked for an hour, then I had a sudden spike again.
In CSF I enabled CT_LIMIT, setting it to limit an IP on port 80,443 to 100 connections for 5 minutes.
These feel like band-aids, though, not solutions.
0 -
I really think you need to have an admin take a direct look at the system. There's too many issues happening for there to not be some overarching problem affecting everything.
0 -
cPRex, my server provider has been looking at it, too, and haven't seen any reason for the spikes. But I KNOW they didn't happen until I began using Cloudflare. Coincidence? Maybe, I don't know.
Almost immediately after I enabled CT_LIMIT, the spikes stopped. But I set it to email me when it hits the limit and I haven't had any emails, so I honestly don't know if that fixed it or if it's just a coincidence. I also enabled CF_ENABLE at around 9pm last night, so maybe that helped? But it wouldn't explain why I was having spikes all morning, then went 6 hours between enabling CT_LIMIT and CF_ENABLE without a spike.
Since restarting Apache seems to break the spike (for at least a minute), that eliminates a lot of the possible issues. rkhunter runs regularly and hasn't found any issues, either, so I don't think there's a viral issue (and I would have probably seen that in top, anyway).
I've had this server for almost 3 years now, and haven't made any major changes other than regular updates. And there've been no major coding changes on my end other than using Cloudflare.
If the issue isn't related to Cloudflare then the only thing left is a misconfigured Apache configuration :-/
0 -
Do the spikes happen frequently enough that you could submit a ticket and we could catch it in the act? If so, that seems like it would be a great plan if your license is purchased through us.
0 -
I'm afraid not :-/ I get an email when the 5 minute average is high, and here's how many I've gotten this month:
Dec 2 => 2
Dec 3 => 6
Dec 4 => 5
Dec 5 => 4
Dec 7 => 1
Dec 8 => 5
Dec 9 => 4
Dec 10 => 3
Dec 11 => 9
Dec 12 => 1
Dec 14 => 7That doesn't count the times that I saw it spiking and restarted Apache, though, because in those cases the 5 minute average wasn't that high.
The last spike was on Dec 14, but there's no real explanation for it. As I mentioned in the previous post, the only change that I made was to enable CT_LIMIT, but I haven't had any emails that it was used so I don't know that it actually helped.
I have sys-snap running, but all of the logs have been overwritten now so that's no good. I have 2 of them saved from Dec 14, do you think that would be enough for them to investigate?
0 -
It couldn't hurt! It's at least something, and the worst we can say is "this isn't helpful"
0
Please sign in to leave a comment.
Comments
12 comments