Symptoms
Sometimes a service will report as down, despite the service being up and connectable through manual checks.
Description
When a service check fails cPanel monitoring software, checksrvd will alert you when configured through the Contact Manager. This happens when the checksrvd daemon is unable to connect to the local service, such as MySQL, Apache, or Exim.
Workaround
The below information contains a few examples of what can be checked to confirm the service is set correct.
How process checks are performed
In /etc/chkserv.d/
, there are files for each service. As an example, here is the one for clamd:
service[clamd]=x,x,x, /scripts/restartsrv_clamd ,clamd,root |
The syntax for these files is as follows:
service[SERVICE]=PORT, SEND, RESPONSE, RE-START COMMAND,PROCESS NAME, PROCESS USER |
An example of the port/response check can be seen with /etc/chkserv.d/pop
:
service[pop]=110,QUIT,.OK, ,dovecot||courier&&authdaemond,root,,imap/usr/local/cpanel/scripts/restartsrv_imap |
At least for services that look like the above, chkservd runs /scripts/restartsrv_$SERVICE --check
. If that command produces any output, chkservd assumes that the service has failed and restarts it using /scripts/restartsrv_$SERVICE --restart
.
Permissions
The permissions on the subdirs in /var/cpanel/serviceauth/
need to be 700
(711 on the serviceauth directory itself), with the exception of the /var/cpanel/serviceauth/exim
directory, which should be 750
.
Logging into Services as chkservd
The login information used for the chkservd service authentication test can be found here:
/var/cpanel/serviceauth/$SERVICE |
[cptech@server ~]# ls -l /var/cpanel/serviceauth drwx------ 2 root root 4096 May 14 2008 cpdavd/ drwx------ 2 root root 4096 May 14 2008 cpsrvd/ drwxr-x--- 2 root mail 4096 Jan 16 18:47 exim/ drwx------ 2 root root 4096 Jan 16 18:47 ftpd/ drwx------ 2 root root 4096 Jan 16 18:47 imap/ |
[cptech@server ~] # ls -l /var/cpanel/serviceauth/imap -rw-r--r-- 1 root root 64 Jan 16 22:27 recv -rw-r--r-- 1 root root 64 Jan 16 22:27 send |
To construct the username, append the contents of the "send" file to __cpanel__service__auth__$SERVICE__
.
So for the IMAP example used above, assuming the contents are as follows:
[cptech@server ~]# cat /var/cpanel/serviceauth/imap/send bUOXmbwKMNfm7MJVhf1ZXr8Z2pmXnu9IOlYZJnszqHE5b3XUL3scJhphjkTwks77 |
[cptech@server ~]#cat /var/cpanel/serviceauth/imap/recv 3tqM64Si_iAvvgPKiupipzW0Jc7tDe8rXVgtOse6dE8EZorFPNm3xDw129b24NZW |
The username to log into IMAP would be:
__cpanel__service__auth__imap__bUOXmbwKMNfm7MJVhf1ZXr8Z2pmXnu9IOlYZJnszqHE5b3XUL3scJhphjkTwks77 |
Please note the double underscores. With these values, the password would be this:
3tqM64Si_iAvvgPKiupipzW0Jc7tDe8rXVgtOse6dE8EZorFPNm3xDw129b24NZW |
Note:
This username and password will change every time chkservd restarts the server (about every 10 min if the service is being reported as down). If you need to troubleshoot this, please consider disabling tailwatchd temporarily.
Example
Here is an actual example of logging into IMAP:
cd /var/cpanel/serviceauth/imap |
more send |
more recv |
[cptech@server ~]# telnet localhost 143 |
Services Up and Reported as Down
Unknown HZ Value
A common cause of services reported as down, when they are actually up, is the following error:
Unknown HZ value! |
Typically, this is due to a faulty procps. For example:
[cptech@server ~]# /bin/ps -V Unknown HZ value! (87) Assume 100. procps version 2.0.6 |
To fix this, update procpcs.
SSHD failing
Potential cause:
A specific ListenAddress
is defined in /etc/ssh/sshd_config
. EX:
ListenAddress 201.100.10.1 |
This will prevent local connections to SSH:
[cptech@server ~]# ssh -p 22 127.0.0.1 ssh: connect to host 127.0.0.1 port 22: Connection refused |
Solution:
Disable the ListenAddress
line or add an additional ListenAddress
line for 127.0.0.1
. Then, restart the SSHD service.
Exim failing
chkservd thinks Exim is down, though a direct telnet to localhost port shows that it's up and listening.
Potential cause:
A custom ACL in /etc/exim.conf
, most likely a banner delay:
[cptech@server ~]# grep delay /etc/exim.conf accept delay = 15s |
It appears chkservd wants to see the 220 banner almost immediately after connecting to the service. The above custom delay causes the 220 banner to wait about 15 seconds before displaying, but by that time chkservd thinks it's down and does the restart and fires off the warning email.
Solution:
Comment out the delay line in /etc/exim.conf
(or use the configuration editor in WHM), restart Exim, and explain to the customer why that particular ACL causes for failures/restarts in chkservd. Note that lowering the delay to 1 second might avoid the failure warnings, but even at 2 seconds it was failing. You can watch for successes/failures by tailing the chkservd log:
tail -f /var/log/chkservd.log |
(Hit enter a few times to clear white space, then either wait for the next run or use /scripts/restartsrv_chkservd
using screen or a different terminal to force a check.)
Alt. Solution: Check /etc/resolv.conf
on server, and ensure the nameservers there are responding quickly enough for Exim.
Exim - Timeout while trying to get data from service: Died at /usr/local/cpanel/Cpanel/TailWatch/ChkServd.pm
line 812, <$socket_scc> line 11.
Check the Exim logs:
[cptech@server ~] # tail -f /var/log/exim_mainlog 2014-04-07 19:14:37 1WXKcV-0001V9-UF => mauricioandradehn@gmail.com R=lookuphost T=remote_smtp H=gmail-smtp-in.l.google.com [173.194.76.26] X=UNKNOWN:ECDHE-RSA-AES128-GCM-SHA256:128 C="250 2.0.0 OK 1396919677 a3si188460qat.97 - gsmtp" 2014-04-07 19:14:37 1WXKcV-0001V9-UF Completed 2014-04-07 19:16:18 SMTP connection from [127.0.0.1]:35804 (TCP/IP connection count = 1) 2014-04-07 19:16:18 SMTP connection from localhost [127.0.0.1]:35804 lost 2014-04-07 19:17:22 SMTP connection from [::1]:49454 (TCP/IP connection count = 1) 2014-04-07 19:19:06 Berkeley DB error: __fop_file_setup: Retry limit (100) exceeded 2014-04-07 19:19:06 failed to open DB file /var/spool/exim/db/ratelimit: File exists 2014-04-07 19:19:06 H=localhost [::1]:49454 Warning: ACL "warn" statement skipped: condition test deferred: ratelimit database not available 2014-04-07 19:19:06 SMTP connection from localhost [::1]:49454 closed by QUIT 2014-04-07 19:19:08 SMTP connection from [127.0.0.1]:35839 (TCP/IP connection count = 1) ^C |
If you see this:
2014-04-07 19:19:06 Berkeley DB error: __fop_file_setup: Retry limit (100) exceeded 2014-04-07 19:19:06 failed to open DB file /var/spool/exim/db/ratelimit: File exists 2014-04-07 19:19:06 H=localhost [::1]:49454 Warning: ACL "warn" statement skipped: condition test deferred: ratelimit database not available
Check to see if the file exists and if it's been renamed:
find /var/spool/exim/db/ -type f -iname '*ratelimit*' |
Most likely it will show up as __db.ratelimit
, go ahead and rename it, then restart Exim, then restart chkservd, and confirm it's working:
mv -v /var/spool/exim/db/__db.ratelimit /var/spool/exim/db/ratelimit |
/scripts/restartsrv_exim |
/scripts/restartsrv_tailwatchd & tail -f /var/log/chkservd.log |
MySQL failing but it's actually up and the configured root MySQL password is correct
See if CloudLinux's db_governor
is installed by running:
rpm -qa | grep governor
If it is installed, the mysql pid file is stored at /var/run/mysqld/mysqld.pid
instead of /var/lib/mysql/$HOSTNAME.pid
when using db_governor
. So, chkservd is simply checking a non-existent pid file to see if mysql is up. A symlink needs to be created to fix this:
ln -s /var/run/mysqld/mysqld.pid /var/lib/mysql/$HOSTNAME.pid |
Services listed in Service Manager are not the ones showing in Service Status
Check for unusual entries in /usr/local/cpanel/Cpanel/TailWatch/
Timeout while trying to get data from service
Check the clock. One way to check would be to run top
and then break out and immediately run date
. There should be at most a 1-second difference. If there's a drift, it could cause issues with chkservd's tests.
This error can also appear when Exim on port 465 is enabled in both Service Manager and the Exim configuration file. To address this issue, uncheck the box for "Exim on another port" in Service Manager, then verify that the server is still listening on port 465.
Clock issues
As mentioned above, a messed-up clock will cause issues with chkservd.
Networking
For at least cpsrvd, the source IP needs to be localhost. Odd network/route/iptables rules can interfere with this, so that connections to 127.0.0.1
are actually routed via another IP or interface. This prevents authentication from working and should be corrected. This was corrected by reviewing the masqueraded loopback address.
chkservd process has become non-responsive (hung)
You may find a customer with a CentOS 7 server who received an email that chkservd has hung for an extended amount of time. If you check their list of running processes, you may see something like this:
[cptech@server ~]#ps faux|grep dove root 20291 0.1 0.0 52700 14668 ? S 21:25 0:00 \_ /usr/local/cpanel/scripts/restartsrv_dovecot root 20309 85.8 0.1 681416 31808 ? R 21:25 4:44 \_ /usr/bin/systemctl status dovecot.service <------------- root 25800 0.0 0.0 112592 984 pts/0 S+ 21:30 0:00 \_ grep --color=auto dove root 1242 0.0 0.0 18352 1588 ? Ss 21:06 0:00 /usr/sbin/dovecot -F -c /etc/dovecot/dovecot.conf dovenull 1246 0.0 0.0 45392 3496 ? S 21:06 0:00 \_ dovecot/pop3-login dovenull 1248 0.0 0.0 45520 3860 ? S 21:06 0:00 \_ dovecot/imap-login dovecot 1249 0.0 0.0 9432 1160 ? S 21:06 0:00 \_ dovecot/anvil root 1250 0.0 0.0 9564 1360 ? S 21:06 0:00 \_ dovecot/log dovenull 1252 0.0 0.0 45392 3504 ? S 21:06 0:00 \_ dovecot/pop3-login root 1253 0.0 0.0 11784 3192 ? S 21:06 0:00 \_ dovecot/config dovenull 1254 0.0 0.0 45516 3860 ? S 21:06 0:00 \_ dovecot/imap-login dovecot 1255 0.0 0.0 27892 2280 ? S 21:06 0:00 \_ dovecot/auth |
The issue here is that the systemctl status dovecot.service
command is using a large amount of CPU. If you strace
the process, you may also see something like this:
# strace -vvtfp 20309 -s 3000 Process 20309 attached 21:31:21 mmap(NULL, 8388608, PROT_READ, MAP_SHARED, 7, 0x6ebb000) = 0x7fe89d436000 21:31:21 munmap(0x7fe8b58f8000, 8388608) = 0 21:31:21 mmap(NULL, 8388608, PROT_READ, MAP_SHARED, 7, 0x76e8000) = 0x7fe8b58f8000 21:31:21 munmap(0x7fe89d436000, 8388608) = 0 21:31:21 write(1, "Jan 14 10:37:20 coyote.bizzahost.com dovecot[13046]: pop3-login: Disconnected (no auth attempts in 1 secs): user=<>, rip=41.190.12.116, lip=50.30.46.201, TLS handshaking: Disconnected, session=<OncW30gp/3kpvgx0>\n", 212) = 212 21:31:21 mmap(NULL, 8388608, PROT_READ, MAP_SHARED, 7, 0x4bb7000) = 0x7fe89d436000 21:31:21 munmap(0x7fe8a7436000, 8388608) = 0 21:31:21 mmap(NULL, 8388608, PROT_READ, MAP_SHARED, 7, 0x2b7a000) = 0x7fe8a7436000 21:31:21 munmap(0x7fe8b58f8000, 8388608) = 0 21:31:21 mmap(NULL, 8388608, PROT_READ, MAP_SHARED, 7, 0x4ff4000) = 0x7fe8b58f8000 21:31:21 munmap(0x7fe8c34f5000, 4669440) = 0 21:31:21 mmap(NULL, 4669440, PROT_READ, MAP_SHARED, 7, 0x7b8c000) = 0x7fe8c34f5000 21:31:21 munmap(0x7fe8a7436000, 8388608) = 0 21:31:21 mmap(NULL, 8388608, PROT_READ, MAP_SHARED, 7, 0x2b7a000) = 0x7fe8a7436000 21:31:21 munmap(0x7fe8a6c36000, 8388608) = 0 21:31:21 mmap(NULL, 8388608, PROT_READ, MAP_SHARED, 7, 0x72e8000) = 0x7fe8a6c36000 21:31:21 munmap(0x7fe8c34f5000, 4669440) = 0 21:31:21 mmap(NULL, 4669440, PROT_READ, MAP_SHARED, 7, 0x7b8c000) = 0x7fe8c34f5000 21:31:21 munmap(0x7fe89d436000, 8388608) = 0 21:31:21 mmap(NULL, 8388608, PROT_READ, MAP_SHARED, 7, 0x53f9000) = 0x7fe89d436000 21:31:21 munmap(0x7fe8a7436000, 8388608) = 0 21:31:21 mmap(NULL, 8388608, PROT_READ, MAP_SHARED, 7, 0x2b7a000) = 0x7fe8a7436000 21:31:21 munmap(0x7fe8a6c36000, 8388608) = 0 21:31:21 mmap(NULL, 8388608, PROT_READ, MAP_SHARED, 7, 0x72e8000) = 0x7fe8a6c36000 21:31:21 munmap(0x7fe8c34f5000, 4669440) = 0 21:31:21 mmap(NULL, 4669440, PROT_READ, MAP_SHARED, 7, 0x7b8c000) = 0x7fe8c34f5000 21:31:21 munmap(0x7fe8b58f8000, 8388608) = 0 21:31:21 mmap(NULL, 8388608, PROT_READ, MAP_SHARED, 7, 0x57fa000) = 0x7fe8b58f8000 21:31:21 munmap(0x7fe89d436000, 8388608) = 0 21:31:21 mmap(NULL, 8388608, PROT_READ, MAP_SHARED, 7, 0x5c76000) = 0x7fe89d436000 21:31:21 munmap(0x7fe8b58f8000, 8388608) = 0 21:31:21 mmap(NULL, 8388608, PROT_READ, MAP_SHARED, 7, 0x614a000) = 0x7fe8b58f8000 21:31:21 munmap(0x7fe89d436000, 8388608) = 0 21:31:21 mmap(NULL, 8388608, PROT_READ, MAP_SHARED, 7, 0x6ebb000) = 0x7fe89d436000 21:31:21 munmap(0x7fe8b58f8000, 8388608) = 0 21:31:21 mmap(NULL, 8388608, PROT_READ, MAP_SHARED, 7, 0x76e8000) = 0x7fe8b58f8000 21:31:21 munmap(0x7fe89d436000, 8388608) = 0 21:31:21 write(1, "Jan 14 10:38:02 coyote.bizzahost.com dovecot[13046]: pop3-login: Aborted login (no auth attempts in 0 secs): user=<>, rip=127.0.0.1, lip=127.0.0.1, secured, session=<IzKU4UgpX5B/AAAB>\n", 184) = 184[cptech@server ~] |
This may indicate that the journals on the server are corrupted. To confirm this, manually run journalctl --verify
:
# journalctl --verify |
If there are any bad journals found, stop journald, rename the old journal directory, then restart journald again:
[cptech@server ~]# systemctl stop systemd-journald.service Warning: Stopping systemd-journald.service, but it can still be activated by: systemd-journald.socket Warning: Stopping systemd-journald.service, but it can still be activated by: systemd-journald.socket |
[cptech@server ~]# mv -v /var/log/journal/c1b714f8c308426e9436fd8bc5a49206{,-cptechs} ‘/var/log/journal/c1b714f8c308426e9436fd8bc5a49206’ -> ‘/var/log/journal/c1b714f8c308426e9436fd8bc5a49206-cptechs’ |
Check the logs for restarts:
awk '{if (/Restarting/) print $1, $2, $NF}' /var/log/chkservd.log | sort | uniq |
chkservd not checking a service as it's "too soon" since last restart
You can make chkservd check on the current status of a service by moving the service file aside from /var/run/chkservd.services_suspend
and then restarting the tailwatchd service. For example, if Exim is not being checked by chkservd:
mv -v /var/run/chkservd.services_suspend/exim /var/run/chkservd.services_suspend/exim.BAK |
/scripts/restartsrv_tailwatchd |
pure-ftpd Error: Home Directory Not Available - aborting
If the /var/cpanel/userhomes/cpanel
directory does not exist, or has invalid permissions, chkservd will report FTPD as down when it is actually up, as seen here:
ftpd [[socket_service_auth:1]TCP Transaction Log: |
You can resolve this by creating the directory, and ensuring it has the proper ownership, with the following command:
mkdir -p /var/cpanel/userhomes/cpanel; chown cpanel: /var/cpanel/userhomes/cpanel |
cpsrvd giving 401 error
If chkservd is reporting a 401 error from cpsrvd, in many cases this is caused by CSF/iptables routing rules. Here is an example error message that you may see in /var/log/chkservd.log
:
cpsrvd [[http_service_auth:1]cpsrvd: [HTTP/1.0 401 Access Denied != HTTP/1.x 200 OK] |
You can verify the routing with the following command:
curl -k --silent http://127.0.0.1:2086 > /dev/null && tail /usr/local/cpanel/logs/access_log -n1 |
chkservd logs - too soon after restart to check
If you want to clear this so the service check occurs on the next chkservd run then you can move the file in /var/run/chkservd.services_suspend/
and restart tailwatchd.
mv /var/run/chkservd.services_suspend/exim /var/run/chkservd.services_suspend/exim.BAK /scripts/restartsrv_tailwatchd |
Comments
0 comments
Article is closed for comments.