[CPANEL-32760] Amazon S3 backup transport failure (Amazon responded with 403 Forbidden)
I am facing an intermittent issue with backup transport to Amazon S3. The majority of accounts copy across just fine, however one or two accounts fail with the error message:
After each failure the system retries the backup process for these accounts and fails each time. Every subsequent run of the backup process fails on the same accounts (though these are not the largest accounts). I can't work out specifically why these accounts are failing and others are succeeding. I was able to trace this back to the MultiPart upload function `_do_multipart_upload` in `/usr/local/cpanel/Cpanel/Transport/Files/AmazonS3.pm`. It appears that one part of the MultiPart upload is failing for some reason, which is crashing the entire upload. Below is an excerpt from one of the backup transport logs. This shows a successful transfer, followed by a failed transfer. This particular one recovered on the third try, however this is not always the case.
Note the timestamps. The backup appears to fail part-way through the transfer process, with the first attempt failing at 32 minutes, the second at 31 minutes and the final successful transfer taking 1:59. I dug through the forums and feature requests and found this request: Again, this won't resolve the problem, but would make it much faster. Has anybody experienced a similar issue? Why are we seeing a 403 forbidden error on some of these requests?
Upload attempt failed: Amazon::S3: Amazon responded with 403 Forbidden
at /usr/local/cpanel/3rdparty/perl/530/lib/perl5/cpanel_lib/Amazon/S3/Bucket.pm line 136.
After each failure the system retries the backup process for these accounts and fails each time. Every subsequent run of the backup process fails on the same accounts (though these are not the largest accounts). I can't work out specifically why these accounts are failing and others are succeeding. I was able to trace this back to the MultiPart upload function `_do_multipart_upload` in `/usr/local/cpanel/Cpanel/Transport/Files/AmazonS3.pm`. It appears that one part of the MultiPart upload is failing for some reason, which is crashing the entire upload. Below is an excerpt from one of the backup transport logs. This shows a successful transfer, followed by a failed transfer. This particular one recovered on the third try, however this is not always the case.
[2020-07-07 01:46:19 +1000] info [cpbackup_transporter] Upload attempt #1 starting for /backup/weekly/2020-07-07/accounts/agedcareservices.tar.gz to /weekly/2020-07-07/accounts/agedcareservices.tar.gz for destination: Articuno Backup
[2020-07-07 01:46:31 +1000] info [cpbackup_transporter] Successful transfer of /backup/weekly/2020-07-07/accounts/agedcareservices.tar.gz to /weekly/2020-07-07/accounts/agedcareservices.tar.gz for destination Articuno Backup
[2020-07-07 01:46:31 +1000] info [cpbackup_transporter] The backup has been successfully uploaded at least once, now we will delete the local copy (/backup/weekly/2020-07-07/accounts/agedcareservices.tar.gz) since keeplocal (0) is disabled.
[2020-07-07 01:46:40 +1000] info [cpbackup_transporter] cpbackup_transporter - Processing next task
[2020-07-07 01:46:40 +1000] info [cpbackup_transporter] Instantiating Object
[2020-07-07 01:46:41 +1000] info [cpbackup_transporter] Starting a "copy" operation on the "Articuno Backup" destination ID "FM*****redacted*****BP8".
[2020-07-07 01:46:41 +1000] info [cpbackup_transporter] Validating destination path /weekly/2020-07-07/accounts/
[2020-07-07 01:46:42 +1000] info [cpbackup_transporter] Path exists
[2020-07-07 01:46:42 +1000] info [cpbackup_transporter] Uploading account backup /backup/weekly/2020-07-07/accounts/johnson1.tar.gz to /weekly/2020-07-07/accounts/johnson1.tar.gz (from weekly/2020-07-07/accounts/johnson1.tar.gz)
[2020-07-07 01:46:42 +1000] info [cpbackup_transporter] Attempting to upload /backup/weekly/2020-07-07/accounts/johnson1.tar.gz to /weekly/2020-07-07/accounts/johnson1.tar.gz for destination: Articuno Backup
[2020-07-07 01:46:42 +1000] info [cpbackup_transporter] Upload attempt #1 starting for /backup/weekly/2020-07-07/accounts/johnson1.tar.gz to /weekly/2020-07-07/accounts/johnson1.tar.gz for destination: Articuno Backup
[2020-07-07 02:18:49 +1000] warn [cpbackup_transporter] Upload attempt failed: Amazon::S3: Amazon responded with 403 Forbidden
at /usr/local/cpanel/3rdparty/perl/530/lib/perl5/cpanel_lib/Amazon/S3/Bucket.pm line 136.
at /usr/local/cpanel/Cpanel/LoggerAdapter.pm line 27.
Cpanel::LoggerAdapter::warn(Cpanel::LoggerAdapter=HASH(0x17183e0), "Upload attempt failed: Amazon::S3: Amazon responded with 403 "...) called at /usr/local/cpanel/Cpanel/Backup/Queue.pm line 658
Cpanel::Backup::Queue::transport_backup::attempt_to_upload_file(Cpanel::Backup::Queue::transport_backup=HASH(0x151a120), Cpanel::Transport::Files::AmazonS3=HASH(0x37bab98), "/backup/weekly/2020-07-07/accounts/johnson1.tar.gz", "/weekly/2020-07-07/accounts/johnson1.tar.gz", Cpanel::LoggerAdapter=HASH(0x17183e0)) called at /usr/local/cpanel/Cpanel/Backup/Queue.pm line 333
Cpanel::Backup::Queue::transport_backup::process_task(Cpanel::Backup::Queue::transport_backup=HASH(0x151a120), Cpanel::TaskQueue::Task=HASH(0x3fc3078), Cpanel::LoggerAdapter=HASH(0x17183e0)) called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 635
eval {...} called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 638
Cpanel::TaskQueue::__ANON__() called at /usr/local/cpanel/Cpanel/StateFile.pm line 241
eval {...} called at /usr/local/cpanel/Cpanel/StateFile.pm line 241
Cpanel::StateFile::Guard::call_unlocked(Cpanel::StateFile::Guard=HASH(0x3fdfb40), CODE(0x303db28)) called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 643
Cpanel::TaskQueue::process_next_task(Cpanel::TaskQueue=HASH(0x25bf8f8)) called at /usr/local/cpanel/bin/cpbackup_transporter line 168
eval {...} called at /usr/local/cpanel/bin/cpbackup_transporter line 166
[2020-07-07 02:18:51 +1000] info [cpbackup_transporter] Upload attempt #2 starting for /backup/weekly/2020-07-07/accounts/johnson1.tar.gz to /weekly/2020-07-07/accounts/johnson1.tar.gz for destination: Articuno Backup
[2020-07-07 02:49:54 +1000] warn [cpbackup_transporter] Upload attempt failed: Amazon::S3: Amazon responded with 403 Forbidden
at /usr/local/cpanel/3rdparty/perl/530/lib/perl5/cpanel_lib/Amazon/S3/Bucket.pm line 136.
at /usr/local/cpanel/Cpanel/LoggerAdapter.pm line 27.
Cpanel::LoggerAdapter::warn(Cpanel::LoggerAdapter=HASH(0x17183e0), "Upload attempt failed: Amazon::S3: Amazon responded with 403 "...) called at /usr/local/cpanel/Cpanel/Backup/Queue.pm line 658
Cpanel::Backup::Queue::transport_backup::attempt_to_upload_file(Cpanel::Backup::Queue::transport_backup=HASH(0x151a120), Cpanel::Transport::Files::AmazonS3=HASH(0x37bab98), "/backup/weekly/2020-07-07/accounts/johnson1.tar.gz", "/weekly/2020-07-07/accounts/johnson1.tar.gz", Cpanel::LoggerAdapter=HASH(0x17183e0)) called at /usr/local/cpanel/Cpanel/Backup/Queue.pm line 333
Cpanel::Backup::Queue::transport_backup::process_task(Cpanel::Backup::Queue::transport_backup=HASH(0x151a120), Cpanel::TaskQueue::Task=HASH(0x3fc3078), Cpanel::LoggerAdapter=HASH(0x17183e0)) called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 635
eval {...} called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 638
Cpanel::TaskQueue::__ANON__() called at /usr/local/cpanel/Cpanel/StateFile.pm line 241
eval {...} called at /usr/local/cpanel/Cpanel/StateFile.pm line 241
Cpanel::StateFile::Guard::call_unlocked(Cpanel::StateFile::Guard=HASH(0x3fdfb40), CODE(0x303db28)) called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 643
Cpanel::TaskQueue::process_next_task(Cpanel::TaskQueue=HASH(0x25bf8f8)) called at /usr/local/cpanel/bin/cpbackup_transporter line 168
eval {...} called at /usr/local/cpanel/bin/cpbackup_transporter line 166
[2020-07-07 02:49:55 +1000] info [cpbackup_transporter] Upload attempt #3 starting for /backup/weekly/2020-07-07/accounts/johnson1.tar.gz to /weekly/2020-07-07/accounts/johnson1.tar.gz for destination: Articuno Backup
[2020-07-07 04:50:09 +1000] info [cpbackup_transporter] Successful transfer of /backup/weekly/2020-07-07/accounts/johnson1.tar.gz to /weekly/2020-07-07/accounts/johnson1.tar.gz for destination Articuno Backup
[2020-07-07 04:50:09 +1000] info [cpbackup_transporter] The backup has been successfully uploaded at least once, now we will delete the local copy (/backup/weekly/2020-07-07/accounts/johnson1.tar.gz) since keeplocal (0) is disabled.
[2020-07-07 04:50:09 +1000] info [cpbackup_transporter] cpbackup_transporter - Processing next task
Note the timestamps. The backup appears to fail part-way through the transfer process, with the first attempt failing at 32 minutes, the second at 31 minutes and the final successful transfer taking 1:59. I dug through the forums and feature requests and found this request: Again, this won't resolve the problem, but would make it much faster. Has anybody experienced a similar issue? Why are we seeing a 403 forbidden error on some of these requests?
-
I have also been encountering similar intermittent transport failures recently. This only started occurring since a few weeks ago however I have been getting failures almost daily, however the affected accounts are different each day. Prior to this backups had been working without issue, and I haven't made any configuration changes recently. The log usually contains "Upload attempt failed: Amazon::S3: Amazon responded with 403" the same as in the previous comment, but I have also noticed this failure message on certain attempts: "Cpanel::Transport::Files::put" attempted to validate a file that does not exist: /backup/2020-07-16/accounts/AFFECTEDACCOUNT.tar.gz at /usr/local/cpanel/Cpanel/LoggerAdapter.pm line 27. An account could fail on due to either or both reasons for subsequent attempts, and I haven't so far been able to identify a pattern as to why this is occurring. Looking at the backup logs, the backups all appear to be generating with no failures, so i'm not really sure why they would return this error. 0 -
It sounds as though a timeout is being reached though this isn't clear as to which or why. Can you please open a ticket using the link in my signature? Once open please reply with the Ticket ID here so that we can update this thread with the resolution once the ticket is resolved. Thanks! 0 -
I have done some more digging into the Transporter Logs this morning and can confirm that the affected accounts are indeed random. Most accounts, however, recover after the first or second failure (it tries each account three times). I have also noticed that this only affects two of our WHM Virtual Servers. Our other three Virtual Servers do have any 403 failures in the logs, even though they are in the same datacentre, same subnet and use the same AWS access keys to the same S3 account. One of these is even on the same physical server (Host) as one of the affected Virtual Servers. I have opened a support ticket. ID: 93510870 0 -
Hello, I checked in on that ticket and found that there has been some work done it and it is currently awaiting your response. I'm following it now and will do my best to add any updates as they're available. 0 -
I have an update. I was able to determine (by modifying the output of the '_croak_if_response_error' in /usr/local/cpanel/3rdparty/perl/530/lib/perl5/cpanel_lib/Amazon/S3.pm
to log the server response message rather than just the status-code) that the specific error message was:"RequestTimeTooSkewed": "The difference between the request time and the current time is too large."
. The server time is correct (within 4 seconds of Amazon's servers). The log output is below:[2020-07-21 11:48:17 +1000] info [cpbackup_transporter] Upload attempt #1 starting for /cluster/2020-07-21/accounts/runf$ [2020-07-21 13:48:23 +1000] warn [cpbackup_transporter] Upload attempt failed: Amazon::S3: Amazon responded with 403 Forbidden -
RequestTimeTooSkewed
The difference between the request time and the current time is too large.20200721T033148Z2020-07-21T03:48:24Z900000
The request was apparently made at 03:31:48 and Amazon's server received this at: 03:48:24. A 16 minute, 36 second difference. The log entry time confirms that our server time is only off by a few seconds (+10hrs due to timezone). It seems that the upload request is stalling part-way through for some reason.0 -
I just double checked the ticket to make sure it was added there as well. Thanks for this and I am watching closely to see if it spawns a case. I am curious to understand the cause of this delay, it doesn't seem to make a lot of sense. Nice move though modifying the PM you may want to backup the modification in case an update overwrites it as well. 0 -
So the cPanel Support were not able to determine why these requests are being delayed in-transit. They have given up on diagnosing this on the basis that this is likely a network issue rather than software issue - fair, but we are stuck without a functioning backup system. This morning the issue has affected another of our servers (now affecting 3 of our 5 servers). Is there any way to have this fix pushed through as a priority? 0 -
That doesn't seem to be the sentiment from the analyst's notes in the ticket. They left notes indicating that you were going to perform some networking tests rather than have them perform more from our end. Maybe there was some misunderstanding in communication? 0 -
Sorry, I should have mentioned that I did indeed complete the testing as recommended by the support technicians. I was unable to reply to the ticket as it was closed. I did not find anything noteworthy in these tests. I don't believe that there is a major issue with the networking setup on our server. Any network stability issues appear to be external to our servers on the Internet somewhere between our server and Amazon's servers. I believe the major issue here is therefore that the backup transporter does not gracefully handle random request failures - instead restarting the entire transfer process. It is essentially impossible to guarantee delivery of all requests over the Internet. In our case these are sending from our server in a Sydney Datacentre, to Amazon S3 in the US - a quick traceroute shows that this is 29 hops - so there is plenty of opportunity for something to go wrong. We are seeing roughly 1 in 1000 PUT requests failing during the backup transport process (with each request having a payload of ~10MB). While this number is a bit high, it is not unheard of, and requests this large are basically a worst-case scenario. The backup transporter should have been designed to handle these type of errors. Designing this to function only in a best-case scenario where there are no lost requests was reckless. This is why I am pushing for this fix: 0 -
Hi @pwells I'm having that L3 ticket reopened for you and I'm going to include the above response in the ticket notes for the analysts to see for you. 0 -
One of our servers are experiencing the same behavior. The server have 4 accounts, the smallest makes a 9GB backup file. The server is hosted in Brazil and in this case the error is not intermittent. Every upload attempt is failling after 20 to 40 minutes since July [2020-08-12 02:24:21 -0300] warn [cpbackup_transporter] Upload attempt failed: Amazon::S3: Amazon responded with 403 Forbidden at /usr/local/cpanel/3rdparty/perl/530/lib/perl5/cpanel_lib/Amazon/S3/Bucket.pm line 136. at /usr/local/cpanel/Cpanel/LoggerAdapter.pm line 27. Cpanel::LoggerAdapter::warn(Cpanel::LoggerAdapter=HASH(0x11ef890), "Upload attempt failed: Amazon::S3: Amazon responded with 403 "...) called at /usr/local/cpanel/Cpanel/Backup/Queue.pm line 658 Cpanel::Backup::Queue::transport_backup::attempt_to_upload_file(Cpanel::Backup::Queue::transport_backup=HASH(0xff1200), Cpanel::Transport::Files::AmazonS3=HASH(0x20f7dd8), "/backup/2020-08-12/accounts/vendasgoverno.tar.gz", "/2020-08-12/accounts/vendasgoverno.ta$ Cpanel::Backup::Queue::transport_backup::process_task(Cpanel::Backup::Queue::transport_backup=HASH(0xff1200), Cpanel::TaskQueue::Task=HASH(0x208ebe0), Cpanel::LoggerAdapter=HASH(0x11ef890)) called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 635 eval {...} called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 638 Cpanel::TaskQueue::__ANON__() called at /usr/local/cpanel/Cpanel/StateFile.pm line 241 eval {...} called at /usr/local/cpanel/Cpanel/StateFile.pm line 241 Cpanel::StateFile::Guard::call_unlocked(Cpanel::StateFile::Guard=HASH(0x208e478), CODE(0x1c4e220)) called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 643 Cpanel::TaskQueue::process_next_task(Cpanel::TaskQueue=HASH(0x208e538)) called at /usr/local/cpanel/bin/cpbackup_transporter line 168 eval {...} called at /usr/local/cpanel/bin/cpbackup_transporter line 166
0 -
Hi guys, based on the discussion in the ticket and some notes added to it that you're unable to see, there is a suggestion that this may be related to an internal case CPANEL-32760 - S3 Compatible have a default region of us-east-1, this can fail for some custom regions. I believe this suggestion is correct and I believe it to be related to the issues you're both having most likely. I sent the tech support supervisor a message indicating this and my assumptions and I'm awaiting her response. Until then, I let her know that I'm going to go ahead and mark this thread as related to the internal case. I'll let you know what comes of this when I have more information. In the internal case there is a suggested workaround for this which I don't particularly like but it's as follows: [QUOTE] Known work around(s) Modify the code and disable updates. The /usr/local/cpanel/3rdparty/perl/530/lib/perl5/cpanel_lib/Amazon/S3.pm file is part of the cpanel-perl-530-Amazon-S3-0.45-1.cp1186.noarch RPM which cannot be excluded without excluding all of cpanel-perl-530. 0 -
Hello @cPanelLauren I just want you to know that on saturday (15th) my server was able to send the backups again and the error is gone. The last failure was on August 14th. Nothing was changed on my side, though. 0
Please sign in to leave a comment.
Comments
13 comments