Skip to main content

[CPANEL-32760] Amazon S3 backup transport failure (Amazon responded with 403 Forbidden)

Comments

13 comments

  • f3cp
    I have also been encountering similar intermittent transport failures recently. This only started occurring since a few weeks ago however I have been getting failures almost daily, however the affected accounts are different each day. Prior to this backups had been working without issue, and I haven't made any configuration changes recently. The log usually contains "Upload attempt failed: Amazon::S3: Amazon responded with 403" the same as in the previous comment, but I have also noticed this failure message on certain attempts: "Cpanel::Transport::Files::put" attempted to validate a file that does not exist: /backup/2020-07-16/accounts/AFFECTEDACCOUNT.tar.gz at /usr/local/cpanel/Cpanel/LoggerAdapter.pm line 27. An account could fail on due to either or both reasons for subsequent attempts, and I haven't so far been able to identify a pattern as to why this is occurring. Looking at the backup logs, the backups all appear to be generating with no failures, so i'm not really sure why they would return this error.
    0
  • cPanelLauren
    It sounds as though a timeout is being reached though this isn't clear as to which or why. Can you please open a ticket using the link in my signature? Once open please reply with the Ticket ID here so that we can update this thread with the resolution once the ticket is resolved. Thanks!
    0
  • pwells
    I have done some more digging into the Transporter Logs this morning and can confirm that the affected accounts are indeed random. Most accounts, however, recover after the first or second failure (it tries each account three times). I have also noticed that this only affects two of our WHM Virtual Servers. Our other three Virtual Servers do have any 403 failures in the logs, even though they are in the same datacentre, same subnet and use the same AWS access keys to the same S3 account. One of these is even on the same physical server (Host) as one of the affected Virtual Servers. I have opened a support ticket. ID: 93510870
    0
  • cPanelLauren
    Hello, I checked in on that ticket and found that there has been some work done it and it is currently awaiting your response. I'm following it now and will do my best to add any updates as they're available.
    0
  • pwells
    I have an update. I was able to determine (by modifying the output of the '_croak_if_response_error' in /usr/local/cpanel/3rdparty/perl/530/lib/perl5/cpanel_lib/Amazon/S3.pm
    to log the server response message rather than just the status-code) that the specific error message was: "RequestTimeTooSkewed": "The difference between the request time and the current time is too large."
    . The server time is correct (within 4 seconds of Amazon's servers). The log output is below: [2020-07-21 11:48:17 +1000] info [cpbackup_transporter] Upload attempt #1 starting for /cluster/2020-07-21/accounts/runf$ [2020-07-21 13:48:23 +1000] warn [cpbackup_transporter] Upload attempt failed: Amazon::S3: Amazon responded with 403 Forbidden - RequestTimeTooSkewedThe difference between the request time and the current time is too large.20200721T033148Z2020-07-21T03:48:24Z900000
    The request was apparently made at 03:31:48 and Amazon's server received this at: 03:48:24. A 16 minute, 36 second difference. The log entry time confirms that our server time is only off by a few seconds (+10hrs due to timezone). It seems that the upload request is stalling part-way through for some reason.
    0
  • cPanelLauren
    I just double checked the ticket to make sure it was added there as well. Thanks for this and I am watching closely to see if it spawns a case. I am curious to understand the cause of this delay, it doesn't seem to make a lot of sense. Nice move though modifying the PM you may want to backup the modification in case an update overwrites it as well.
    0
  • pwells
    So the cPanel Support were not able to determine why these requests are being delayed in-transit. They have given up on diagnosing this on the basis that this is likely a network issue rather than software issue - fair, but we are stuck without a functioning backup system. This morning the issue has affected another of our servers (now affecting 3 of our 5 servers). Is there any way to have this fix pushed through as a priority?
    0
  • cPanelLauren
    That doesn't seem to be the sentiment from the analyst's notes in the ticket. They left notes indicating that you were going to perform some networking tests rather than have them perform more from our end. Maybe there was some misunderstanding in communication?
    0
  • pwells
    Sorry, I should have mentioned that I did indeed complete the testing as recommended by the support technicians. I was unable to reply to the ticket as it was closed. I did not find anything noteworthy in these tests. I don't believe that there is a major issue with the networking setup on our server. Any network stability issues appear to be external to our servers on the Internet somewhere between our server and Amazon's servers. I believe the major issue here is therefore that the backup transporter does not gracefully handle random request failures - instead restarting the entire transfer process. It is essentially impossible to guarantee delivery of all requests over the Internet. In our case these are sending from our server in a Sydney Datacentre, to Amazon S3 in the US - a quick traceroute shows that this is 29 hops - so there is plenty of opportunity for something to go wrong. We are seeing roughly 1 in 1000 PUT requests failing during the backup transport process (with each request having a payload of ~10MB). While this number is a bit high, it is not unheard of, and requests this large are basically a worst-case scenario. The backup transporter should have been designed to handle these type of errors. Designing this to function only in a best-case scenario where there are no lost requests was reckless. This is why I am pushing for this fix:
    0
  • cPanelLauren
    Hi @pwells I'm having that L3 ticket reopened for you and I'm going to include the above response in the ticket notes for the analysts to see for you.
    0
  • plague
    One of our servers are experiencing the same behavior. The server have 4 accounts, the smallest makes a 9GB backup file. The server is hosted in Brazil and in this case the error is not intermittent. Every upload attempt is failling after 20 to 40 minutes since July [2020-08-12 02:24:21 -0300] warn [cpbackup_transporter] Upload attempt failed: Amazon::S3: Amazon responded with 403 Forbidden at /usr/local/cpanel/3rdparty/perl/530/lib/perl5/cpanel_lib/Amazon/S3/Bucket.pm line 136. at /usr/local/cpanel/Cpanel/LoggerAdapter.pm line 27. Cpanel::LoggerAdapter::warn(Cpanel::LoggerAdapter=HASH(0x11ef890), "Upload attempt failed: Amazon::S3: Amazon responded with 403 "...) called at /usr/local/cpanel/Cpanel/Backup/Queue.pm line 658 Cpanel::Backup::Queue::transport_backup::attempt_to_upload_file(Cpanel::Backup::Queue::transport_backup=HASH(0xff1200), Cpanel::Transport::Files::AmazonS3=HASH(0x20f7dd8), "/backup/2020-08-12/accounts/vendasgoverno.tar.gz", "/2020-08-12/accounts/vendasgoverno.ta$ Cpanel::Backup::Queue::transport_backup::process_task(Cpanel::Backup::Queue::transport_backup=HASH(0xff1200), Cpanel::TaskQueue::Task=HASH(0x208ebe0), Cpanel::LoggerAdapter=HASH(0x11ef890)) called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 635 eval {...} called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 638 Cpanel::TaskQueue::__ANON__() called at /usr/local/cpanel/Cpanel/StateFile.pm line 241 eval {...} called at /usr/local/cpanel/Cpanel/StateFile.pm line 241 Cpanel::StateFile::Guard::call_unlocked(Cpanel::StateFile::Guard=HASH(0x208e478), CODE(0x1c4e220)) called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 643 Cpanel::TaskQueue::process_next_task(Cpanel::TaskQueue=HASH(0x208e538)) called at /usr/local/cpanel/bin/cpbackup_transporter line 168 eval {...} called at /usr/local/cpanel/bin/cpbackup_transporter line 166
    0
  • cPanelLauren
    Hi guys, based on the discussion in the ticket and some notes added to it that you're unable to see, there is a suggestion that this may be related to an internal case CPANEL-32760 - S3 Compatible have a default region of us-east-1, this can fail for some custom regions. I believe this suggestion is correct and I believe it to be related to the issues you're both having most likely. I sent the tech support supervisor a message indicating this and my assumptions and I'm awaiting her response. Until then, I let her know that I'm going to go ahead and mark this thread as related to the internal case. I'll let you know what comes of this when I have more information. In the internal case there is a suggested workaround for this which I don't particularly like but it's as follows: [QUOTE] Known work around(s) Modify the code and disable updates. The /usr/local/cpanel/3rdparty/perl/530/lib/perl5/cpanel_lib/Amazon/S3.pm file is part of the cpanel-perl-530-Amazon-S3-0.45-1.cp1186.noarch RPM which cannot be excluded without excluding all of cpanel-perl-530.
    0
  • plague
    Hello @cPanelLauren I just want you to know that on saturday (15th) my server was able to send the backups again and the error is gone. The last failure was on August 14th. Nothing was changed on my side, though.
    0

Please sign in to leave a comment.