Up to 18 hour delay for checkallsslcerts triggered web app errors during PEAK HOURS!
Incident: Multiple services were restarted in the middle of the day which triggered a series of web app errors due to the disruptions in Apache, MariaDB, Exim, Dovecot, etc. It was quite disconcerting at the time as my server appeared to be under attack or something.
Cause: /usr/local/cpanel/bin/checkallsslcerts updated the SSL certificate for cPanel services, then restarted the various services to use the new cert.
Chain of events leading to this degradation during peak hours:
/usr/local/cpanel/scripts/upcp runs via cron at 2:20AM
Among many other things, upcp runs:
/usr/local/cpanel/scripts/maintenance
The maintenance script schedules checkallsslcerts to run at a random time, which I assume is to reasonably distribute load for certificate servers, but delaying up to 18 hours later is unacceptable for any business which aspires to be professional and not potentially lose customers or prospects who happen to be trying to use the services at the time they become unavailable.
Yes I know the SSL certificate is updated only once every few months, but this disruption during peak hours is avoidable. Shorter delays and the fact that servers are in different timezones should be enough to distribute load.
Furthermore as stated in cPanel Docs > The checkallsslcerts Script:
The system runs the /usr/local/cpanel/bin/checkallsslcerts script in the following situations:
- During the nightly cPanel & WHM update (upcp) process.
- When you purchase or add a cPanel & WHM license.
So the documentation is also incorrect or at least incomplete, as nowhere does it mention that it will be scheduled to run at a random time up to 18 hours later.
-
Hey there! I hadn't heard about this one before, but you are absolutely correct - the maintenance tool picks a random time within the next 18 hours, which we can see inside the /usr/local/cpanel/scripts/maintenance file:
sub action_checkallsslcerts {
# Base install does this in the background before upcp
return if $ENV{'CPANEL_BASE_INSTALL'};
my $max_delay_seconds = 18 * 60 * 60; # 18 hours
my $bytes_to_get = length($max_delay_seconds) + 1;
my $rand_int = Cpanel::Rand::Get::getranddata( $bytes_to_get, [ 0 .. 9 ] );
my $delay_seconds = $rand_int % $max_delay_seconds;
# Should be between 1 and $max_delay_seconds
# scheduling a task for 0 seconds will cause queueprocd to throw an error
$delay_seconds++;
return (
show_status('Scheduling task to check service default SSL/TLS certificates'),
sub {
Cpanel::ServerTasks::schedule_task(
['SSLTasks'],
$delay_seconds,
'checkallsslcerts'You are also correct about the "why" - we originally implemented this to keep cPanel requests from overloading the AutoSSL servers at Sectigo.
One possible workaround would be flush the task queue on the server immediately after upcp runs. This would force the scheduled random maintenance to happen immediately. You could do that by adding the following line to /scripts/postupcp:
/usr/local/cpanel/bin/servers_queue run
Let me know if that helps!
0 -
My personal solution was to change $max_delay_seconds to 1800 i.e. half an hour which is plenty for the load issue, especially when cert updates are only a few times a year, and I need it to end before my scheduled 3AM WHM Backup.
But my main reason for this post was to identify an issue that affects all cPanel users. Probably the vast majority aren't even aware of it. But it is a bug which needs to be resolved for all users instead of individual sysadmins editing cPanel scripts on their servers, especially when those scripts might be overwritten in future upgrades.
In other words, it should either function as stated in the docs (2AM) or at least limit the max delay and mention that in the docs.
Can do?
0 -
Sure - I've made case CPANEL-44141 to bring this up to our SSL team and I've linked them this thread as well. If I hear any updates on it I'll be sure to post!
0 -
Thank you!
0 -
Sure thing!
0
Please sign in to leave a comment.
Comments
5 comments