[CPANEL-24832] Multiple service failure reports for cpsrvd, IMAP, and Exim
So this used to be a very occasional problem I thought could be attributed to memory issues, but while I've solved the memory problem and it appeared stable for a stretch, we're back to this problem again, but now more often. At least daily, sometimes once every couple days, I'll see alerts that multiple services have failed. CPSRVD is a usual suspect, as is the Bind nameserver. clamd, IMAP, and Exim sometimes follow shortly thereafter. The weird part is, all of the services are still running... it's just that occasionally, they like to throw mostly unhelpful errors.
What I believe is happening is something all of the services that fail commonly use, possibly for authentication, is failing and it's only after CPanel goes through the usual restart routines and stumbles on the thing that's failing that the issue actually self-corrects, until the next time. But until CPanel hits on the service that's failing, I'll see things like this.
dovecot: auth: Error: auth worker: Aborted PASSV request for some@email.address: Shutting down
dovecot: auth: Error: net_connect_unix(anvil-auth-penalty) failed: Permission denied
Or, from another affected service:
2019-02-19 21:38:07.244 [31472] 1gwHlt-0008BS-4Z == some@email.address R=virtual_user T=dovecot_virtual_delivery_no_batch defer (-44): LMTP error after RCPT TO:: 451 4.3.0 Temporary internal error
On the occasion I try to access a site on this server when these issues are on-going, I'll get a 503 error instead of what I'm trying to access. As said, this is all very temporary--usually resolved in about 5 minutes from when CPanel notices, but I'd be curious to know what causes it and if there's any way it can be avoided. The only thing my own digging has been able to uncover is that all of the affected services seem to rely on one common mechanism, and it's that mechanism rather than the individual services that has fallen sideways.
-
cPRex - Fair enough o the 10 yr cutoff. :) I have no idea of the age of a link, but it's possible that they have all been that stale...
On the p0f service issue, I haven't tried manually running upcp to see if it's repeatable, and it's intermittent. It's not been a cause of any trouble so I've back-burnered it hope for a "magical fix" on a future update. ;) I am saving the errors though for an eventual eval and probably a ticket.
-Pete
1 -
Hello @quanin Are you aware if you have cPhulkd enabled? If so can you check the logs there to see if there's any further information: /usr/local/cpanel/logs/cphulkd_errors.log
Further can you check for Out Of Memory errors by running the following:egrep -i 'oom|out of memory' /var/log/messages
0 -
... I have no idea how I missed checking that earlier. I'm also not entirely sure why it doesn't happen more often--I see CPSRVD and sometimes clamd fail daily, but it's been 3 days since the last OOM issue. Also, I thought I'd optimised the heck out of that server. I'm open to suggestions for things to tweak further. [root@server ~]# free -ht total used free shared buff/cache available Mem: 3.7G 1.6G 291M 140M 1.9G 1.7G Swap: 255M 255M 0B Total: 4.0G 1.8G 291M 0 -
Hi @quanin It could be more than just the OOM issues especially if it's been several days. Feel free to open a ticket using the link in my signature. Once open please reply with the Ticket ID here so that we can update this thread with the resolution once the ticket is resolved. Thanks! 0 -
Since November 2018 I am regularly getting emails from my server saying various services have failed. I just checked my inbox and there are 158 unique "service failed" emails from cpanel monitoring since 11 November 2018. It is always imap, exim, and cpsrvd. After 5-15 minutes the services will all recover but they are all unresponsive during this gap. If I ssh in and manually run /scripts/restartsrv_cpsrvd or whatever it is, immediately the problem is fixed. I have just checked /var/log/chkservd.log and it is 20.2MB; 333186 lines long. Many lines like this: [quote] cpsrvd [Service check failed to complete Unable to connect to port 2086 on 127.0.0.1: Connection refused: Died[check command:N/A][socket connect:-][socket failure threshold:3/5]]...
and also this: [quote] exim: ** [535 Incorrect authentication data != 2] : Died[check command:+][socket connect:-][socket failure threshold:4/3][fail count:2]Restarting exim....
and this: [quote] << A001 NO [UNAVAILABLE] Temporary authentication failure. [srv01.smileserv.co.uk:2019-02-20 20:49:30] imap: ** [A001 NO [UNAVAILABLE] Temporary authentication failure. [srv01.smileserv.co.uk:2019-02-20 20:49:30] != A001 OK] : Died[check command:+][socket connect:-][socket failure threshold:4/3][fail count:2]Restarting imap....0 -
I have this same issue, though some of mine may be due to the fact I need to stop being cheap and add more resources. Still, usually it's IMAP, Exim, CPSRVD, and sometimes clamd that takes a nap. Like you, they all usually recover within 5-15 minutes. 0 -
Nice to know I am not the only one. But resources are not a problem for us, we have 2x Xeon E5-2620 and 32GB RAM. The server is not being heavily loaded 0 -
It has to be more than the OOM issues, as neither IMAP, nor Exim, nor CPSRVD were affected. Clamd was, but in the two most recent OOM kills in particular, it was HTTP or MySQL causing it, and usually only clamd and a MySQL process (sometimes Bind) are caught up in it. Feb 19 12:22:36 quantum kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Feb 19 12:22:36 quantum kernel: [] oom_kill_process+0x254/0x3d0 Feb 19 12:22:36 quantum kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Feb 19 12:22:36 quantum kernel: Out of memory: Kill process 13043 (clamd) score 145 or sacrifice child Feb 19 12:22:36 quantum kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Feb 19 12:22:36 quantum kernel: [] oom_kill_process+0x254/0x3d0 Feb 19 12:22:36 quantum kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Feb 19 12:22:36 quantum kernel: Out of memory: Kill process 12890 (named) score 68 or sacrifice child Feb 22 14:11:13 quantum kernel: httpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Feb 22 14:11:13 quantum kernel: [] oom_kill_process+0x254/0x3d0 Feb 22 14:11:13 quantum kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Feb 22 14:11:14 quantum kernel: Out of memory: Kill process 7490 (clamd) score 145 or sacrifice child Feb 22 14:11:14 quantum kernel: httpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Feb 22 14:11:14 quantum kernel: [] oom_kill_process+0x254/0x3d0 Feb 22 14:11:14 quantum kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Feb 22 14:11:14 quantum kernel: Out of memory: Kill process 12056 (mysqld) score 87 or sacrifice child Exim failed last on 2/20, along with IMAP and CPSRVD, and there was no OOM event there. 0 -
It has to be more than the OOM issues, as neither IMAP, nor Exim, nor CPSRVD were affected. Clamd was, but in the two most recent OOM kills in particular, it was HTTP or MySQL causing it, and usually only clamd and a MySQL process (sometimes Bind) are caught up in it.
' I agree with you, please feel free to open a ticket as I suggested previously and let us know the ticket ID here once complete.0 -
Hi @janipewter That's a pretty long gap of time (3 months) Are there any more recent failures, within the last week or two? If so what are they? As much detail as possible would be helpful. 0 -
Hi @janipewter That's a pretty long gap of time (3 months) Are there any more recent failures, within the last week or two? If so what are they? As much detail as possible would be helpful.
What's a pretty long gap? There have been 158 service failures since November. It's happening almost daily, at least once every three days. Multiple services failing.0 -
What's a pretty long gap? There have been 158 service failures since November. It's happening almost daily, at least once every three days. Multiple services failing.
Trying to troubleshoot service failures 3 months ago is not going to be extremely productive as not only have logs been most likely rotated the activity on the server is more than likely different than at that time. If you have some that are recent i.e., in the last few weeks please provide as much possible information on those and we'd be happy to assist you. The advice provided here should be a good start to providing the most helpful information:0 -
I Have the exact same problem on a brand new server. I have dug into it and this is what I have found. 1) Unknown process shuts down cpsrvd /var/log/messages Feb 28 09:29:31 redback systemd: Stopping cPanel services...
2) chkservd notices cpsrvd is down but doesn't do anything about it. Also notices exim and dovecot are not authenticating but cant fix them (presumably due to cpsrvd being down) /var/log/chkservd.log[2019-02-28 09:32:44 +1100] Service check .... cpsrvd [too soon after restart to check]... imap [[socket_service_auth:1]TCP Transaction Log: ... << A001 NO [UNAVAILABLE] Temporary authentication failure. [redback.mudgee.host:2019-02-27 22:32:48] exim [TCP Transaction Log: ... << 535 Incorrect authentication data
3) after 20 minutes, chkservd finally starts up cpsrvd, and exim & dovecot start working again grep "cpsrvd \[" chkservd.logcpsrvd [too soon after restart to check]... cpsrvd [too soon after restart to check]... cpsrvd [Service check failed to complete cpsrvd [Service check failed to complete cpsrvd [Service check failed to complete cpsrvd [[http_service_auth:1][check command:N/A][socket connect:+]]...
For the life of me I cant find what is stopping cpsrvd. Its very regular, every 46 hours. chkservd not taking any action for 20+ minutes isn't very efficient. it notices cpsrvd is down within a few minutes but sits on the information for ages. Clobbering exim & dovecot for 20 minutes during working hours makes my users (hence me) cranky :) Manually restarting cpsrvd immediately restores exim & dovecot0 -
Hi @MHFraser That's definitely a mystery and it may be one that is best looked at by our analysts. Can you please open a ticket using the link in my signature? Once open please reply with the Ticket ID here so that we can update this thread with the resolution once the ticket is resolved. Thanks! 0 -
Logged ticket 11570883 for this. After recent updates I was going to drop it as it hadn't happened, but this morning changed my mind. 0 -
after update to the version 78.0.13 cpsrvd is failing at least once in a 12 hours, there is completely no error in the : /usr/local/cpanel/logs/error_log 0 -
Looks like it is the same problem I have. I would also like to follow this 0 -
Hello, I just checked in on this ticket and it would appear that the services IMAP, EXIM and cPsrvd are all being flagged as being down as a result of an internal case CPANEL-24832 The case is still in the "monitored" but critical status which means we've not identified specifically what's occurring though they have identified that removing these services from dormant mode appears tobe a workaround for the issue at this time. I'll update this thread again if there's more 0 -
@cPanelLauren, I can confirm this problem. I have been tracking it since November. I hadn't found it reported here... until this post! Since it is self-healing I took my time, figuring that an update would fix it. Here's what I can report: - I first experienced it on 11/10/18
- It has occurred 25 times to date
- It occurs at random intervals, for me, ranging from 1-15 days
- It has never been more than once a day
- Time of day varies but is often about an hour or less later than the previous time (but not an identifiable pattern)
- Most of the time it is every 1-6 days, but there was one interval each of 7, 9, 13, and 15 days
- p0f reports down, then 5 minutes later reports up, then 10 minutes later cpcrvd reports down, and reports up in 5 more minutes (it is actually down 20 minutes)
- cpcrvd is always down for 20 minutes
- cpsrvd can be manually started
- p0f also reports down almost every time this occurs (a couple times exim and imap reported down instead)
Mar 6 22:26:52 host systemd: Stopping p0f passive fingerprinter... Mar 6 22:26:52 host p0f: [!] WARNING: User-initiated shutdown. Mar 6 22:26:52 host systemd: Stopping cPanel services... Mar 6 22:26:52 host systemd: Stopped p0f passive fingerprinter. Mar 6 22:26:52 host restartsrv_cpsrvd: Gracefully terminating process: systemctl with pid 11737 and owner root. Mar 6 22:26:52 host restartsrv_cpsrvd: Gracefully terminating process: cpsrvd: with pid 11732 and owner root. Mar 6 22:26:52 host restartsrv_cpsrvd: Gracefully terminating process: /usr/local with pid 11733 and owner root. Mar 6 22:26:52 host restartsrv_cpsrvd: Waiting for 11737,11732,11733 to shutdown ....... terminated. Mar 6 22:26:52 host systemd: Stopped cPanel services. Mar 6 22:27:01 host systemd: Removed slice User Slice of root. Mar 6 22:27:01 host systemd: Created slice User Slice of root. Mar 6 22:27:01 host systemd: Started Session 41273 of user root. Mar 6 22:27:19 host systemd: Starting p0f passive fingerprinter... Mar 6 22:27:19 host p0f: --- p0f 3.09b by Michal Zalewski --- Mar 6 22:27:19 host p0f: [+] Closed 1 file descriptor. Mar 6 22:27:19 host p0f: [+] Loaded 322 signatures from '/usr/local/cpanel/3rdparty/etc/p0f/p0f.fp'. Mar 6 22:27:19 host p0f: [+] Intercepting traffic on interface 'any'. Mar 6 22:27:19 host p0f: [+] Custom filtering rule enabled: less 400 and not dst port 80 and not dst port 443 and tcp[13] & 8==0 Mar 6 22:27:19 host p0f: [+] Listening on API socket '/var/cpanel/userhomes/cpanelconnecttrack/p0f.socket' (max 20 clients). Mar 6 22:27:19 host p0f: [+] Privileges dropped: uid 988, gid 985, root '/var/cpanel/userhomes/cpanelconnecttrack'. Mar 6 22:27:19 host p0f: [+] Daemon process created, PID 11819 (stderr kept as-is). Mar 6 22:27:19 host p0f: Good luck, you're on your own now! Mar 6 22:27:19 host systemd: Started p0f passive fingerprinter. Mar 6 22:27:20 host pure-ftpd: (?@127.0.0.1) [INFO] New connection from 127.0.0.1 Mar 6 22:27:20 host pure-ftpd: (?@127.0.0.1) [INFO] Logout. Mar 6 22:28:01 host systemd: Started Session 41274 of user root. Mar 6 22:29:01 host systemd: Removed slice User Slice of root. Mar 6 22:29:01 host systemd: Created slice User Slice of root. Mar 6 22:29:01 host systemd: Started Session 41275 of user root. Mar 6 22:29:55 host systemd: Created slice User Slice of petez. Mar 6 22:29:55 host systemd-logind: New session 41276 of user petez. Mar 6 22:29:55 host systemd: Started Session 41276 of user petez. Mar 6 22:29:55 host dbus[4955]: [system] Activating service name='org.freedesktop.problems' (using servicehelper) Mar 6 22:29:55 host dbus[4955]: [system] Successfully activated service 'org.freedesktop.problems' Mar 6 22:30:01 host systemd: Started Session 41277 of user root. Mar 6 22:30:01 host systemd: Started Session 41279 of user root. Mar 6 22:30:01 host systemd: Started Session 41278 of user root. Mar 6 22:30:01 host systemd: Started Session 41281 of user root. Mar 6 22:30:01 host systemd: Started Session 41282 of user root. Mar 6 22:30:01 host systemd: Created slice User Slice of munin. Mar 6 22:30:01 host systemd: Started Session 41283 of user munin. Mar 6 22:30:01 host systemd: Started Session 41280 of user root. Mar 6 22:30:10 host systemd: Removed slice User Slice of munin. Mar 6 22:31:01 host systemd: Started Session 41284 of user root. Mar 6 22:32:01 host systemd: Started Session 41285 of user root. Mar 6 22:32:26 host pure-ftpd: (?@127.0.0.1) [INFO] New connection from 127.0.0.1 Mar 6 22:32:26 host pure-ftpd: (?@127.0.0.1) [INFO] Logout. Mar 6 22:33:01 host systemd: Started Session 41286 of user root. Mar 6 22:34:01 host systemd: Started Session 41287 of user root. Mar 6 22:35:01 host systemd: Started Session 41288 of user root. Mar 6 22:35:01 host systemd: Started Session 41289 of user root. Mar 6 22:35:01 host systemd: Started Session 41292 of user root. Mar 6 22:35:01 host systemd: Created slice User Slice of munin. Mar 6 22:35:01 host systemd: Started Session 41293 of user munin. Mar 6 22:35:01 host systemd: Started Session 41291 of user root. Mar 6 22:35:01 host systemd: Started Session 41290 of user root. Mar 6 22:35:09 host systemd: Removed slice User Slice of munin. Mar 6 22:36:01 host systemd: Started Session 41294 of user root. Mar 6 22:37:01 host systemd: Started Session 41295 of user root. Mar 6 22:37:33 host pure-ftpd: (?@127.0.0.1) [INFO] New connection from 127.0.0.1 Mar 6 22:37:33 host pure-ftpd: (?@127.0.0.1) [INFO] Logout. Mar 6 22:38:01 host systemd: Started Session 41296 of user root. Mar 6 22:39:01 host systemd: Started Session 41298 of user root. Mar 6 22:39:01 host systemd: Started Session 41297 of user root. Mar 6 22:40:01 host systemd: Started Session 41301 of user root. Mar 6 22:40:01 host systemd: Started Session 41300 of user root. Mar 6 22:40:01 host systemd: Created slice User Slice of munin. Mar 6 22:40:01 host systemd: Started Session 41299 of user munin. Mar 6 22:40:01 host systemd: Started Session 41302 of user root. Mar 6 22:40:01 host systemd: Started Session 41304 of user root. Mar 6 22:40:01 host systemd: Started Session 41303 of user root. Mar 6 22:40:10 host systemd: Removed slice User Slice of munin. Mar 6 22:41:01 host systemd: Started Session 41306 of user root. Mar 6 22:41:01 host systemd: Started Session 41305 of user root. Mar 6 22:42:01 host systemd: Started Session 41307 of user root. Mar 6 22:42:39 host pure-ftpd: (?@127.0.0.1) [INFO] New connection from 127.0.0.1 Mar 6 22:42:39 host pure-ftpd: (?@127.0.0.1) [INFO] Logout. Mar 6 22:43:01 host systemd: Started Session 41308 of user root. Mar 6 22:44:01 host systemd: Started Session 41309 of user root. Mar 6 22:45:01 host systemd: Started Session 41310 of user root. Mar 6 22:45:01 host systemd: Started Session 41311 of user root. Mar 6 22:45:01 host systemd: Created slice User Slice of munin. Mar 6 22:45:01 host systemd: Started Session 41312 of user munin. Mar 6 22:45:01 host systemd: Started Session 41313 of user root. Mar 6 22:45:09 host systemd: Removed slice User Slice of munin. Mar 6 22:46:01 host systemd: Started Session 41314 of user root. Mar 6 22:47:01 host systemd: Started Session 41315 of user root. Mar 6 22:47:42 host pure-ftpd: (?@127.0.0.1) [INFO] New connection from 127.0.0.1 Mar 6 22:47:42 host pure-ftpd: (?@127.0.0.1) [INFO] Logout. Mar 6 22:47:42 host systemd: Starting cPanel services... Mar 6 22:47:42 host systemd: Stopping p0f passive fingerprinter... Mar 6 22:47:42 host p0f: [!] WARNING: User-initiated shutdown. Mar 6 22:47:42 host systemd: Starting mailman services... Mar 6 22:47:42 host systemd: Stopped p0f passive fingerprinter. Mar 6 22:47:42 host systemd: Starting p0f passive fingerprinter... Mar 6 22:47:42 host p0f: --- p0f 3.09b by Michal Zalewski --- Mar 6 22:47:42 host p0f: [+] Closed 1 file descriptor. Mar 6 22:47:42 host p0f: [+] Loaded 322 signatures from '/usr/local/cpanel/3rdparty/etc/p0f/p0f.fp'. Mar 6 22:47:42 host p0f: [+] Intercepting traffic on interface 'any'. Mar 6 22:47:42 host p0f: [+] Custom filtering rule enabled: less 400 and not dst port 80 and not dst port 443 and tcp[13] & 8==0 Mar 6 22:47:42 host p0f: [+] Listening on API socket '/var/cpanel/userhomes/cpanelconnecttrack/p0f.socket' (max 20 clients). Mar 6 22:47:42 host p0f: [+] Privileges dropped: uid 988, gid 985, root '/var/cpanel/userhomes/cpanelconnecttrack'. Mar 6 22:47:42 host p0f: [+] Daemon process created, PID 16487 (stderr kept as-is). Mar 6 22:47:42 host p0f: Good luck, you're on your own now! Mar 6 22:47:42 host systemd: Started p0f passive fingerprinter. Mar 6 22:47:42 host restartsrv_mailman: (XID cpmtqy) The "mailman" service is not configured: there are no configured mailing lists Mar 6 22:47:42 host systemd: PID file /usr/local/cpanel/3rdparty/mailman/data/master-qrunner.pid not readable (yet?) after start. Mar 6 22:47:42 host systemd: Failed to start mailman services. Mar 6 22:47:42 host systemd: Unit mailman.service entered failed state. Mar 6 22:47:42 host systemd: mailman.service failed. Mar 6 22:47:42 host restartsrv_cpsrvd: License is valid and has already updated recently. Mar 6 22:47:42 host restartsrv_cpsrvd: Starting PID 16492: /usr/local/cpanel/libexec/cpsrvd-dormant Mar 6 22:47:42 host systemd: Failed to read PID from file /var/run/cpsrvd.pid: Invalid argument Mar 6 22:47:42 host systemd: cpanel.service: Supervising process 16492 which is not our child. We'll most likely not notice when it exits. Mar 6 22:47:42 host systemd: Started cPanel services. Mar 6 22:47:42 host systemd: Starting cPanel Greylisting Daemon... Mar 6 22:47:42 host restartsrv_cpgreylistd: (XID ayvqp2) The "cpgreylistd" service is disabled. Mar 6 22:47:42 host systemd: cpgreylistd.service: control process exited, code=exited status=2 Mar 6 22:47:42 host systemd: Failed to start cPanel Greylisting Daemon. Mar 6 22:47:42 host systemd: Unit cpgreylistd.service entered failed state. Mar 6 22:47:42 host systemd: cpgreylistd.service failed. Mar 6 22:48:01 host systemd: Started Session 41316 of user root. Mar 6 22:49:01 host systemd: Started Session 41317 of user root. Mar 6 22:50:01 host systemd: Started Session 41322 of user root. Mar 6 22:50:01 host systemd: Started Session 41321 of user root. Mar 6 22:50:01 host systemd: Started Session 41319 of user root. Mar 6 22:50:01 host systemd: Started Session 41323 of user root.
I have the applicable sections of logs from today saved. I hope this helps. I am not able to open a ticket at this time (I expect you'll ask) but can do so for a future occurance, if someone else hasn't already done so. -Pete0 -
Hi @PeteS Issues like this, especially intermittent issues are easiest to troubleshoot with access when the issue occurring. If possible next time please do open a ticket as we haven't one opened at this point. I'm also curious why cpsrvd is restarting those services is anything noted in the chksrvd logs? /var/log/chkservd.log or in the tailwatchd logs? /usr/local/cpanel/logs/tailwatchd_log 0 -
Hi @PeteS Issues like this, especially intermittent issues are easiest to troubleshoot with access when the issue occurring. If possible next time please do open a ticket as we haven't one opened at this point. I'm also curious why cpsrvd is restarting those services is anything noted in the chksrvd logs? /var/log/chkservd.log or in the tailwatchd logs?
To be clear, the issue only has an event window of 20 minutes, after which all we have are log entries. There's no way anyone could catch the issue, open a ticket, and have a tech looking at in it that time frame. I have data for an event (on 3/6/19) from the following logs: /var/log/chkservd.log /var/log/messages /usr/local/cpanel/logs/tailwatchd_log /usr/local/cpanel/logs/error_log I can supply them (un-obfuscated) if you give me a method to send you the file. I won't post them here. (I will be out for the weekend so if I don't reply quickly, that's why.) It would also seem prudent to look at changes in the cPanel versions I cited to see what may have affected this. I recognize the 11.76 update would have had a lot in it, but gotta start somewhere. -Pete0 -
Ticket created, number : 11638985, thank you very much Michael. 0 -
Hi, I'm experiencing nightly problems on Exim/Smtp/Cpsrvd. I'm receiving mails about nightly failures and recoveries of those services, lasting about 15 minutes. Not every day, but most. I find the following in the cpanel error log I attach Is there a way to properly debug/solve the problem? Thank you. 0 -
Ticket 11651935, thank you. 0 -
The event occurred again this morning. Since I have time I the office today, I have opened a ticket (#11652337) and am currently #63 (and moving backward). It might be in your best interest to escalate it, but it's up to you. I referenced this thread, provided detailed log info on my 3/6 event, and also noted the one from today. -Pete 0 -
Hi Pete, It's possible this relates to the issue noted on the following thread: *does* seem to be the workaround. I have made that change and will report back. I suspect if no occurrences happen within 14 days that it confirms the cause. If it is the unloading/reloading then my server being low volume would have more opportunities for the issue to manifest itself, as compared to a higher volume server where it would unload less often. It also makes sense that most of the time I am seeing this at off-peak hours. I wonder if others posting here about this are in a similar situation (low volume server, and/or occurring at off-peak times). -Pete
0 -
Hello Everyone, I've merged multiple threads here so we can better track reports of this happening. Internal case CPANEL-24832 is open to track an issue where Chkservd reports service failures for the cpsrvd, IMAP, and EXIM services on a regular basis. I'll monitor this case and update this thread with more information as it becomes available. In the meantime, the temporary workaround is to disable cpsrvd in the Dormant services section under the Software tab in WHM >> Tweak Settings. Let us know if you have any questions. Thank you. 0 -
Sorry double post for some reason. 0 -
Thank you looking into this. I have a managed Cloud VPS with Inmotion Hosting which has 4 websites ( Server Version: Apache/2.4.38 (cPanel) OpenSSL/1.0.2q mod_bwlimited/1.4 mod_cpanel/1.4 Server MPM: event), and have been having this issue every few days since around end of 2018. WHM is set to update each day, so I'm currently running 78.0.17. cphulkd, httpd, apache_php_fpm, cpanellogd, crond, exim, cpservd are all examples of the services I get notified are down. I also find that if I set the off-peak times for CPBackup, Backup and UPCP to my preferred times, they switch back to other different times after a couple of days. IMH Tech support thought it was a memory issue, but later confirmed I had not exceeded my allocation. I ended up with a 5 day trial on the next package up with double the memory (3GB RAM burstable as needed to 6GB) and the failure notifications seemed less, so I upgraded permanently. However every few days I still get notifications about services failing and my websites are still going down. It is so frustrating. Support have forgotten about it now. The one thing they said was that a process had taken CPHulk down, but the process ID wasn't listed anywhere to tell them what it was. [QUOTE] With today's investigation, Chkservd (this service that sent the email, it's what makes sure other services are online) did correctly determine that cPHulk was offline, during that time period, while every other service was online. Since we only see that cPHulk failed, I checked /usr/local/cpanel/logs/cphulkd.log to try and find if this service logged why it was killed. We found the following: [2019-02-15 05:20:18 +0000] info [cPhulkd] DB processor shutdown via SIGTERM with pid 6181 [2019-02-15 05:20:18 +0000] info [cPhulkd] processor shutdown via SIGTERM with pid 929 [2019-02-15 05:35:06 +0000] info [cPhulkd] processor startup with pid 7152 [2019-02-15 05:35:06 +0000] info [cPhulkd] DB processor startup with pid 7593 While it is normal for cPHulk's DB processor to be started and stopped, the processor itself should be remaining online. The above logs show that a process with an ID 929 was what killed cPHulk. Unfortunately, just an hour later and no such process is running as ID 929 any longer, meaning now we can't tell what external process had issued this SIGTERM and killed cPHulk. OOM kills originate from the kernel, not as some process ID, so that rules the low memory/RAM theory out. To give us better resources to help dig deeper into these service downtime events, I've temporarily installed some advanced logging via cPanel System Snapshot, which does take process logs every few minutes, and keeps them for 24 hours. If you recieve another service downtime email, reply to us again just like you did today, and hopefully with this more advanced logging, we can reach a conclusion and see a resolution.
When my sites and services go down, WHM Service Status says all the processes are running yet when when I click on Apache Status it says it is not responding. If I manually restart Apache via WHM, it comes back online immediately, so it's frustrating CPanel can't achieve the same thing.0 -
Hello @The Old Man, Can you see if disabling cpsrvd in the Dormant services section under the Software tab in WHM >> Tweak Settings addresses the issue? This should solve the problem until case CPANEL-24832 is published. Thank you. 0
Please sign in to leave a comment.
Comments
51 comments