Apache Infrastructure Team
SVN Service Outage - PostMortem
On Wednesday December 3rd the main US host for the ASF subversion service fails resulting in loss of service. This loss of subversion service prevent committers from submitting any changes, and whilst we have an EU mirror it is read-only and does not allow for any changes to be submitted whilst the master is offline.
The cause of the outage was a failed disk. This failed disk was part of a mirrored OS pair. Some time prior to this the alternate disk had also been replaced due to a failed state.
0401 UTC 2014-10-26 - eris daily run output notes the degraded state of root disk gmirror
1212 UTC 2014-10-30 - INFRA-8551 created to deal with gmirror degradation.
2243 UTC 2014-12-02 - OSUOSL replaced disk in eris
0208 UTC 2013-12-03 - Subversion begins to crawl to a halt
0756 UTC 2013-12-03 - First contractor discovers something awry with subversion service
0834 UTC 2013-12-03 - Infrastructure sends out a notice about the svn issue
0916 UTC 2013-12-03 - Response to issue begins
1010 UTC 2013-12-03 - First complaints about mail being slow/down
1025 UTC 2013-12-03 - Discovery that email queue alerts had been silenced.
1225 UTC 2013-12-03 - Discovery that Eris outage affecting LDAP-based services including Jenkins and mail
1613 UTC 2013-12-03 - First attempt at power cycling eris
1717 UTC 2013-12-03 - Concern emerges that the 'good' disk in the mirror isn't.
1744 UTC 2013-12-03 - OSUOSL staff shows up in the office
1752 UTC 2013-12-03 - Blog post went up.
1906 UTC 2014-12-03 - New hermes/baldr (hades) being set up for replacement of eris
1911 UTC 2014-12-03 - #svnoutage clean room in hipchat began
2040 UTC 2014-12-03 - machine finally comes up and is usable.
2050 UTC 2014-12-03 - confusion arises between which switch is in which rack. Impedance mismatch between what OSUOSL calls racks, and what we called racks.
[Dec-3 5:50 PM] Tony Stevenson: whcih rack is this
[Dec-3 5:50 PM] Tony Stevenson: 1, 2 or 3
[Dec-3 5:50 PM] Justin Dugger (pwnguin): 19
[Dec-3 5:50 PM] David Nalley: what switch?
[Dec-3 5:50 PM] Justin Dugger (pwnguin): HW type: HP ProCurve 2530-48G OEM S/N 1: CN2BFPG1F5
[Dec-3 5:50 PM] David Nalley: ^^^^^^^^^ points to this impedance mismatch for the postmortem
[Dec-3 5:50 PM] David Nalley: no label on the switch?
2054 UTC 2014-12-03 - Data copy begins
0441 UTC 2014-12-04 - data migration finished
1457 UTC 2014-12-04 - SVN starts working again - testing begins
0647 UTC 2014-12-05 - svn-master is operational again with viewvc
- It took us far too long to spin up replacement machine. This in fact took a few hours due to having to manually build the host from source media and encountering several BIOS/RaidController issues. Our endeavour to have automated provisioning of tin (bare metal) would certainly have improved this time considerably had it been available at the time of the event.
- Many machines pointing to eris.a.o for LDAP - not to a service name (such as ldap1-us-west for example) which meant we couldn’t easily restore LDAP services for some US hosts without making them also think SVN services had also moved.
- Assigning of issues in JIRA - It has perhaps been a long held understanding that if an issue is assigned to someone in JIRA then they are actively managing that issue. This event clearly shows how fragile that belief is.
- DNS (geo) updates were problematic - Daniel will be posting a proposal on Thursday, which will outline our concerns around DNS and a viable way forward that meets our needs and is not reliant on us storing all the data in SVN to be able to effect changes to zones. (This proposal was not created as a tiger of this event, it has been worked on for a number of weeks now).
- architectural problems for availablility
- We couldn't promote svn-eu to master - data differences/corruption https://issues.apache.org/jira/browse/INFRA-6236
- Current monitoring setup was not sufficient in catching disk errors and correctly alerting infra.
- Daniel to investigate and evaluate multimaster service availability.
- Implement an extended SSL check that not only ensures the service is up, but also checks cert validity (expire, revocation status etc), and the certificate chain is valid.
- De-couple DNS from SVN
- De-couple the SVN authz file from SVN directly. Also breser@ has suggested we use the authz validation tool available from the svn install we have on hades, as part of the template->active file generation process.
- Move the ASF status page (http://status.apache.org) outside of our main colos so folks can continue to see it in the event of an outage.
- Vendor provided hardware monitoring tools mandatory on all new hardware deployments.
- Broader audience for incidents and status reports
- More aggressive host replacement before these issues arise
Things being considered
- Mandatory use of SNMP for enhanced data gathering.
- Issue ‘nagging’ - develop some thoughts and ideas around the concept of auto-transitioning un-modified JIRA issues after N hours of in activity and actively nag the group until an update is made. This for example is how Atlasssian (and so many others) handle their issues. For example if an end-user doesn’t update the issue within 5 days, it is automatically closed, if we don’t update an open issue within 6 hours for a critical issue then we get nagged about it.
- Automatically create new JIRA issues (utilising above mentioned auto-transition) to notify of hardware issues (not just relying on hundreds of cron emails a day).
- Again as part of a wider thinking of how we use issue tracking consider the concept that you only assign an issue to yourself if you are explicitly working on it at that moment, i.e it should not sit in the queue assigned to someone for > N hours and not receive any updates.
Things that went well
- The people working on the issue worked extremely well as a team. Communicating with one another via hipchat and helping each other along where required. There was a real sense of camaraderie for the first time in a very long time and this see of team helped greatly.
- The team put in a bloody hard shift.
- There is now a very solid understanding of the SVN service across at least 4 members of the team, as opposed to 2 x0.5 understandings before.
- A much broader insight into the current design of our infrastructure was gained by the newer members of the team.
Nexus reduced performance issues resolved.
So Tuesday morning we got a report in IRC that a committer was trying to get a release out
and could not deploy. Shortly after a Nexus issue was reported in Jira INFRA-8321. A few
hours later another issue INFRA-8322 related to Nexus was opened. So far, nothing unusual
Yesterday, more issues reported on IRC/HipChat, and more issues opened.
(INFRA-8326,INFRA-8327,INFRA-8328, INFRA-8334). By then it was obvious this more than
a coincidence and it was already being looked into.
Twitter notifications and emails were sent out declaring the degraded performance an outage
and On Call was full time looking into the issue. Others joined the call to assist and eventually
the outage was determined to be a change to LDAP configuration made 2 days ago by Infra.
(See infra:r921805 for the revert of that.)
The LDAP change was made to improve response times as it was being reported as being slow
to return queries. Reverting the change cured the issues Nexus was having contacting the
groups that committers belonged to.
There will be another avenue looked into for improving LDAP query response times whilst not
affecting those services that connect via anon bind.
Infra thanks everyone for their patience whilst this was looked into and resolved.
Thanks go to those involved in working towards the solution:-
Gavin McDonald (gmcdonald)
Tony Stevenson (pctony)
Chris Lambertus (cml)
Daniel Gruno (humbedooh)
Brian Fox (brianf)
Posted at 09:19AM Sep 11, 2014 by administrator in Status | |
New status page for the ASF
We are pleased to announce that we have a new status page for our infrastructure and the ASF as a whole.
Where we have previously been focused on reporting the up/down status of our services, we have now begun to look a bit more at the broader picture of the ASF; What's going on, who is committing how much, where are emails going, what's going on on GitHub mirrors and so on, as well as tracking uptime and availability for our public services that power the ASF's online presence.
The result of this broader scope can be seen on: http://status.apache.org
It is our hope that you'll find this new status page informative and helpful, both in times of trouble and times where everything is in working condition.
Posted at 01:45PM Aug 14, 2014 by humbedooh in Status | |
New monitoring system: nagios is dead long live circonus
23 may 2014 the old monitoring system "nagios" was put to sleep, and "circonus" was given production status.
The new monitoring system is sponsored by circonus and most of the monitoring as well as the central database runs on www.circonus.com. The infrastructure team have built and deployed logic around the standard circonus system:
- A private broker, to monitor internal services without exposing them on internet
- A dedicated broker (inhouse development) that monitor special ASF systems (like svn compare US - EU)
- A configuration system, that are based on svn.
- A new status page status.apache.org
- A new team structure (all committers with sudo karma on a vm, get an email when something happens with the vm)
The new system is a lot faster and we can therefore offer projects monitoring of project URLs, of course the project also need to have a team that handles the alerts.
The current version has approx. the same facilities as Nagios, but we are planning (and actively programming) a version.2 that will allow us to better predict problems before they occur.
Some of the upcoming features are:
- disk monitoring
- vital data statistic from core system (like size of mail queues)
The change of monitoring system is a vital component in our transition to automate services and thereby enable infra to more effectively secure the stability of the infrastructure as well as make early detection of potential problems.
The system was presented in Apachecon denver 2014, slides can be found here. We hope to present the live version at apachecon budapest 2014.
On behalf of the infrastructure team
Posted at 10:29PM May 23, 2014 by jani in Status | |
During the afternoon of May 6th we began experiencing delays in mail delivery of 1-2 hours. Initial efforts at remediation seemed to clear this up but on the morning of May 7th the problem worsened and we proactively disabled mail service to deal with the failure. The underlying hardware suffered failures on multiple disks. This outage effects all ASF mailing lists and mail forwarding.
This service is housed at OSUOSL, and we are currently waiting on smart hands to help with replacing hardware. Our expectation at this point is that we still have multiple hours worth of outage.
Incoming mail is currently being received and held in queue by our mail exchangers. We also have a copy of the existing queue that hasn't been processed; so we expect no mail or data loss.
ASF Infra's twitter bot will provide updates as we have them for the duration of the outage. Feel free to follow @infrabot on Twitter. There will be an update on this post as well as the situation progresses.
UPDATE 7 May 19:27 UTC - Drives have been replaced, array is attempting to rebuild. As indicated earlier on twitter, there likely remains hours of outage.
UPDATE 7 May 20:44 UTC - The disk array is still in the process of repairing. Several hundred mails were processed during a reboot, but more work remains before service is restored. Mail service has been disabled again as the array repair process is CPU-bound. The plan going forward is to allow the disk arrays to finish repairs. Once that is complete, we'll reenable the mail service and flush what is currently in the queue. Finally, once the queue is empty we'll begin receiving mail again.
UPDATE 8 May 05:00 UTC - The disk array failed to repair itself. The disks have been replaced and a new installation has been completed. Progress continues to be made towards resolution, but nothing firm enough yet for us to predict an time for restoration.
UPDATE 8 May 15:45 UTC - No material change of status has occurred. Infra worked in shifts around the clock last night and continue to do so to restore service. More updates as they become available.
UPDATE 9 May 11:20 UTC - We are working on temporarily restoring the most essential email aliases. In the meantime, inquiries may be made to email@example.com or on our IRC channel, #asfinfra on Freenode. The work on restoring the service in full is still ongoing.
UPDATE 9 May 17:20 UTC - We've successfully restored a host from backups and will be starting testing soon. Based on the progress made in those tests we'll try and provide expectations around restoration of service timeline.
UPDATE 10 May 15:45 UTC - We've started pushing live mails through the system - you'll begin to see them trickle in as we gradually open the floodgates to restore service. Expect intermittent spurts for a while.
UPDATE 10 May 21:55 UTC - The floodgates have been opened. As we have a significant amount of backlog to catch up on, please be patient as the service does this. As always feel free to contact us if you have any questions. In the immediate short term (next day or so, we suggest you continue to use firstname.lastname@example.org and our IRC channel, #asfinfra on Freenode. We would like to thank you for your patience during this extremely busy time.
UPDATE 12 May 16:04 UTC - Clarification - we have opened the floodgates, but have a substantial amount of backlog; and with the sudden rush of mail are being throttled by various mail services. With the addition of mail thats coming through anyway; it may take us from 2-5 days to fully flush the backlog. This time is so wide because of a wide variety of factors that are largely outside of our control, such as new mail coming in and mail services individual throttling policies.
New Infra Team Members
Since out last update over a year ago, the Infra Team has expanded by another NINE (9) members!
Congrats and our warmest thanks go to:
Niklas Gustavsson - (ngn)
Jeremy Thomerson - (jrthomerson)
Mark Struberg - (struberg)
Eric Evans - (eevans)
Brandon Williams - (brandonwilliams)
Mohammad Nour El-Din - (mnour)
David Nalley - (ke4qqq)
Yang Shih-Ching - (imacat)
Daniel Gruno - (humbedooh)
The rest of the Infra team look forward to continuing to work with you all.
There are now a total of 80 infrastructure members with another 36 in the infrastructure-interest group.
1 million commits and still going strong.
Yesterday, the main ASF SVN code repository passed the 1 million commit mark. Shortly thereafter one of the ASF members enquired as to how he could best grab the svn log entries for all of these commits. As always there were a bunch of useful replies, but they were all set to take quite some time; mainly because if anyone just simply runs
svn log http://svn.apache.org/repos/asf -r1:1000000
It will not only take several hours, it will also cause high levels of load on one of the two geo-balanced SVN servers. Also, requesting that many log entries will likely result in your IP address being banned.
So I decided to create the log set locally on one of the SVN servers. This is now available for download [http://s.apache.org/1m-svnlog] [md5]
This is a 50Mb tar/gz file. It will uncompress to about 240Mb. The log 'only' contains the log entries from 1 -> 1,000,000 - if you want the rest you can run:
svn log http://svn.apache.org/repos/asf -r1000001:HEAD
This will give you all the log entries from 1M+1 to current
Posted at 11:55AM Sep 23, 2010 by pctony in Status | |
LDAP, groups and SVN - Coupled together
The infrastructure team have now completed the next stage of the planned LDAP migration.
We have migrated our old SVN authorisation file, and POSIX groups into LDAP data. SVN access control is now managed using these groups.
This means to change access the Subversion repositories is now as simple as changing group membership. We use some custom perl scripts that build the equivalent authorisation file meaning that we dont need to use the <location> blocks nasty hack to do this. It also means that all changes, including adding new groups and extending access control is made simple.
ASF PMC chairs, are now able to make changes to their POSIX, and SVN groups whilst logged into people.apache.org - using a selection of scripts:
All of these scripts have a '--help' option to show you how to use them.
What's next? We are now working on adding a custom ASF LDAP schema, that will allow us to record ASF specific data such as ICLA files and date of membership etc.
We will also be looking at adding support for 3rd party applications such as Hudson, and building an identity management portal where people can manage their own account.
Posted at 10:03PM Feb 22, 2010 by pctony in Status | |
apache.org incident report for 8/28/2009
Last week we posted about the security breach that caused us to temporarily suspend some services. All services
have now been restored. We have analyzed the events that led to the breach, and continued to work on improving the security of our systems.
NOTE: At no time were any Apache Software Foundation code repositories, downloads, or users put at risk by this intrusion. However, we believe that providing a detailed account of what happened will make the internet a better place, by allowing others to learn from our mistakes.
Our initial running theory was correct--the server that hosted the apachecon.com (dv35.apachecon.com) website had been compromised. The machine was running CentOS, and we suspect they may have used the recent local root exploits patched in RHSA-2009-1222 to escalate their privileges on this machine. The attackers fully compromised this machine, including gaining root privileges, and destroyed most of the logs, making it difficult for us to confirm the details of everything that happened on the machine.
This machine is owned by the ApacheCon conference production company, not by the Apache Software Foundation. However, members of the ASF infrastructure team had accounts on this machine, including one used to create backups.
attackers attempted unsuccessfully to use passwords from the compromised ApacheCon
host to log on to our production webservers. Later, using the SSH Key of the backup account, they were able to access
people.apache.org (minotaur.apache.org). This account was an unprivileged user, used
to create backups from the ApacheCon host.
minotaur.apache.org runs FreeBSD 7-STABLE, and acts as the staging machine for our mirror
network. It is
our primary shell account server, and provides many other services for Apache developers. None of our Subversion (version control) data is kept on this machine, and there was never any risk to any Apache source code.
Once the attackers had gained shell access, they added CGI scripts to the document root folders of several of our websites. A regular, scheduled rsync process copied these scripts to our production web server, eos.apache.org, where they became externally visible. The CGI scripts were used to obtain remote shells, with information sent using HTTP POST commands.
Our download pages are
dynamically generated, to enable us to present users with a local mirror of our software. This means that all of our domains have ExecCGI enabled, making it harder for us to protect against an attack of this nature.
After discovering the CGI scripts, the infrastructure team decided to shutdown any servers that could potentially have been affected. This included people.apache.org, and both the EU and US website servers. All website traffic was redirected to a known-good server, and a temporary security message was put in place to let people know we were aware of an issue.
One by one, we brought the potentially-affected servers up, in single user mode, using our out of band access. It quickly became clear that aurora.apache.org, the EU website server, had not been affected. Although the CGI scripts had been rsync'd to that machine, they had never been run. This machine was not included in the DNS rotation at the time of the attack.
aurora.apache.org runs Solaris 10, and we were
able to restore the box to a known-good configuration by cloning
and promoting a ZFS snapshot from a day before the CGI scripts were synced
over. Doing so enabled us to bring the EU server back online, and to rapidly restore our main websites. Thereafter, we continued to analyze the cause of the breach, the method of access, and which, if any, other machines had been compromised.
Shortly after bringing up
aurora.apache.org we determined that the most likely route of the breach was
the backup routine from dv35.apachecon.com. We grabbed all the
available logs from dv35.apachecon.com, and promptly shut it down.
Analysis continued on minotaur.apache.org and eos.apache.org (our US
server), until we were confident that all remants of the attackers had been removed. As each server was declared clean, it was brought back online.
- The use of ZFS snapshots enabled us to restore the EU production web server to a known-good state.
- Redundant services in two locations allowed us to run services from an alternate location while continuing to work on the affected servers and services.
- A non-uniform set of compromised machines (Linux/CentOS i386, FreeBSD-7 amd_64, and Solaris 10 on sparc) made it difficult for the attackers to escalate privileges on multiple machines.
What didn't work?
use of SSH keys facilitated this attack. In hindsight, our implementation left a lot to be
desired--we did not restrict SSH keys appropriately, and we were
unaware of their misuse.
- The rsync setup, which uses people.apache.org to manage the deployment of our websites, enabled the attackers to get their files onto the US mirror, undetected.
- The ability to run CGI scripts in any virtual host, when most of our websites do not need this functionality, made us unneccesarily vulnerable to an attack of this nature.
- The lack of logs from the ApacheCon host prevents us from conclusively determining the full course of action taken by the attacker. All but one log file were deleted by the attacker, and logs were not kept off the machine.
What changes we are making now?As a result of this intrusion we are making several changes, to help further secure our infrastructure from such issues in the future. These changes include the following:
- Requiring all users with elevated privileges to use OPIE for sudo on certain machines. We already require this in some places, but will expand its use as necessary.
and using new SSH keys, one per host, for backups. Also enforcing use of the
from="" and command="" strings in the authorized keys file on the
destination backup server. In tandem with access restrictions which only allow connections
from machines that are actually backing up data, this will prevent 3rd party
machines from being able to establish an SSH connection.
- The command="" string in the authorized_keys file is now explicit, and only allows one way rsync traffic, due to the paths and flags used.
- New keys have been generated for all hosts, with a minimum key length of at least 4096 bits .
VM that hosted the old apachecon.com site remains powered down, awaiting
further detailed analysis. The apachecon.com website has been re-deployed on a
new VM, with a new provider and different operating system.
- We are looking at disabling CGI support on most of our website systems. This has led to the creation of a new httpd module that will handle things like mirror locations for downloads.
method by which most of our public facing websites are deployed to our production servers will also change, becoming a much more automated process. We hope to have switched over to a SvnSubPub / SvnWcSub based system within the next few weeks.
- We will re-implement measures such as IP banning after several failed logins, on all machines.
proposal has been made to introduce centralized logging. This would include all system logs, and possibly also services such as smtpd and httpd.
Confluence 2.10 migration for cwiki.a.o 11 July
The ASF Infrastructure Team will be upgrading the Confluence instance powering http://cwiki.apache.org from Confluence 2.2.9 to Confluence 2.10.3 on July 11 at 0400 UTC, or July 10 at 2100 PST. The migration is expected to take several hours.
If you haven't already, this would be a good time to check the test migration instance at:
Exported pages can be found at http://confluence-test.zones.apache.org:8080/export/SPACE_KEY/PAGE_TITLE.html If in doubt, find your existing exported pages at http://cwiki.apache.org/, so:
As much as possible, the space export templates will be preserved in the migration, although changes to the Confluence UI will mean the exports will look different.
Further updates with regards to the Confluence 2.10.3 migration will posted to this blog.
The Confluence 2.10.3 upgrade has been completed and all spaces have been exported. There are a few things to note:
- The Gliffy license is out of date. I'll try to track down a new one.
- The visibility plugin doesn't support Confluence 2.10.3. Not sure if anyone uses it, however.
- The exported html, as warned, generally looks a bit different. Let me know if you have any issues tweaking your template.
Update 11-07-2009 part 2
If, for some reason, your templates didn't get copied over or the exported site is so messed up you need the old version, the old files are available:
- Autoexport templates - http://cwiki.apache.org/autoexport-2.2.9-templates
- Autoexport-generated html - http://cwiki.apache.org/autoexport-2.2.9
Update 14-07-2009The Gliffy folks were kind enough to give us a new license. Please re-export any applicable spaces.
Posted at 07:04AM Jul 07, 2009 by mrdon in Status | |
Slow SVN Service This Week
In preparation for upgrading Subversion to the latest version (1.6.0), we are running an svn dump on svn.apache.org. This will chew up enough disk IO to be noticeable to svn users. We expect the dump to finish sometime during this weekend.
Posted at 06:39PM Mar 25, 2009 by joes in Status | |