Apache Infrastructure Team

Wednesday May 07, 2014

Mail outage

During the afternoon of May 6th we began experiencing delays in mail delivery of 1-2 hours. Initial efforts at remediation seemed to clear this up but on the morning of May 7th the problem worsened and we proactively disabled mail service to deal with the failure. The underlying hardware suffered failures on multiple disks. This outage effects all ASF mailing lists and mail forwarding.

 This service is housed at OSUOSL, and we are currently waiting on smart hands to help with replacing hardware. Our expectation at this point is that we still have multiple hours worth of outage. 

 Incoming mail is currently being received and held in queue by our mail exchangers. We also have a copy of the existing queue that hasn't been processed; so we expect no mail or data loss.  

ASF Infra's twitter bot will provide updates as we have them for the duration of the outage. Feel free to follow @infrabot on Twitter. There will be an update on this post as well as the situation progresses.

UPDATE 7 May 19:27 UTC - Drives have been replaced, array is attempting to rebuild. As indicated earlier on twitter, there likely remains hours of outage.  

UPDATE 7 May 20:44 UTC - The disk array is still in the process of repairing. Several hundred mails were processed during a reboot, but more work remains before service is restored.  Mail service has been disabled again as the array repair process is CPU-bound. The plan going forward is to allow the disk arrays to finish repairs. Once that is complete, we'll reenable the mail service and flush what is currently in the queue. Finally, once the queue is empty we'll begin receiving mail again.

UPDATE 8 May 05:00 UTC - The disk array failed to repair itself. The disks have been replaced and a new installation has been completed. Progress continues to be made towards resolution, but nothing firm enough yet for us to predict an time for restoration.

UPDATE 8 May 15:45 UTC - No material change of status has occurred. Infra worked in shifts around the clock last night and continue to do so to restore service. More updates as they become available.  

UPDATE 9 May 11:20 UTC - We are working on temporarily restoring the most essential email aliases. In the meantime, inquiries may be made to infrastructure@apache.pw or on our IRC channel, #asfinfra on Freenode. The work on restoring the service in full is still ongoing.

UPDATE 9 May 17:20 UTC - We've successfully restored a host from backups and will be starting testing soon. Based on the progress made in those tests we'll try and provide expectations around restoration of service timeline.

UPDATE 10 May 15:45 UTC - We've started pushing live mails through the system - you'll begin to see them trickle in as we gradually open the floodgates to restore service. Expect intermittent spurts for a while. 

UPDATE 10 May 21:55 UTC -  The floodgates have been opened.  As we have a significant amount of backlog to catch up on, please be patient as the service does this.  As always feel free to contact us if you have any questions. In the immediate short term (next day or so, we suggest you continue to use infrastructure@apache.pw and our IRC channel, #asfinfra on Freenode.  We would like to thank you for your patience during this extremely busy time. 

UPDATE 12 May 16:04 UTC - Clarification - we have opened the floodgates, but have a substantial amount of backlog; and with the sudden rush of mail are being throttled by various mail services. With the addition of mail thats coming through anyway; it may take us from 2-5 days to fully flush the backlog. This time is so wide because of a wide variety of factors that are largely outside of our control, such as new mail coming in and mail services individual throttling policies.  

Comments:

What is the point of buying / renting a disk array if it can't handle what it is really built for, i.e. "repair itself" after a failure?

Posted by Tilman on May 08, 2014 at 08:02 PM UTC #

Thanks for all your work on this. Hang in there ASF Infra!

Posted by Sean Busbey on May 08, 2014 at 08:12 PM UTC #

Thank you all for working hard to resolve this critical issue. Any more updates?

Posted by Gary D. Gregory on May 09, 2014 at 01:13 AM UTC #

@Tilman: Most disk arrays can't recover from multiple disk failures at the same time, or at least more than 2. I don't have the details, but it sounds like there were probably more than 2. One wonders if alerts were not addressed in a timely manner.

Posted by Jeff Janner on May 09, 2014 at 05:47 PM UTC #

For the record, it seems that is also affects the forums! see: See: https://forum.openoffice.org/en/forum/viewtopic.php?f=5&t=69640

Posted by Hagar Delest on May 09, 2014 at 09:10 PM UTC #

Will it be possible to recover the queued messages, or is the array a total loss? Stated another way: Are we going to get a major flood of email when the service resumes?

Posted by Shawn Heisey on May 10, 2014 at 02:02 AM UTC #

Apparently I'm completely blind, because the mail loss thing was actually mentioned. Thanks for covering all bases. People might get the impression that you've been doing this a while!

Posted by Shawn Heisey on May 10, 2014 at 04:09 AM UTC #

@ASF infra team: Thanks for all your work. @Jeff: Indeed, several disks getting belly up at the very same time sounds unlikely. I hope that after the smoke has cleared, they will found out whether either the disk array was of the type "rushed to production, tests later" or whether alarm signs were disregarded by some external contractors. Disk arrays are bought to prevent just the thing that has happened now. Luckily, JIRA and SVN still work :-)

Posted by Tilman on May 10, 2014 at 04:44 PM UTC #

Post a Comment:
Comments are closed for this entry.

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation