Apache Infrastructure Team
SVN Service Outage - PostMortem
On Wednesday December 3rd the main US host for the ASF subversion service fails resulting in loss of service. This loss of subversion service prevent committers from submitting any changes, and whilst we have an EU mirror it is read-only and does not allow for any changes to be submitted whilst the master is offline.
The cause of the outage was a failed disk. This failed disk was part of a mirrored OS pair. Some time prior to this the alternate disk had also been replaced due to a failed state.
0401 UTC 2014-10-26 - eris daily run output notes the degraded state of root disk gmirror
1212 UTC 2014-10-30 - INFRA-8551 created to deal with gmirror degradation.
2243 UTC 2014-12-02 - OSUOSL replaced disk in eris
0208 UTC 2013-12-03 - Subversion begins to crawl to a halt
0756 UTC 2013-12-03 - First contractor discovers something awry with subversion service
0834 UTC 2013-12-03 - Infrastructure sends out a notice about the svn issue
0916 UTC 2013-12-03 - Response to issue begins
1010 UTC 2013-12-03 - First complaints about mail being slow/down
1025 UTC 2013-12-03 - Discovery that email queue alerts had been silenced.
1225 UTC 2013-12-03 - Discovery that Eris outage affecting LDAP-based services including Jenkins and mail
1613 UTC 2013-12-03 - First attempt at power cycling eris
1717 UTC 2013-12-03 - Concern emerges that the 'good' disk in the mirror isn't.
1744 UTC 2013-12-03 - OSUOSL staff shows up in the office
1752 UTC 2013-12-03 - Blog post went up.
1906 UTC 2014-12-03 - New hermes/baldr (hades) being set up for replacement of eris
1911 UTC 2014-12-03 - #svnoutage clean room in hipchat began
2040 UTC 2014-12-03 - machine finally comes up and is usable.
2050 UTC 2014-12-03 - confusion arises between which switch is in which rack. Impedance mismatch between what OSUOSL calls racks, and what we called racks.
[Dec-3 5:50 PM] Tony Stevenson: whcih rack is this
[Dec-3 5:50 PM] Tony Stevenson: 1, 2 or 3
[Dec-3 5:50 PM] Justin Dugger (pwnguin): 19
[Dec-3 5:50 PM] David Nalley: what switch?
[Dec-3 5:50 PM] Justin Dugger (pwnguin): HW type: HP ProCurve 2530-48G OEM S/N 1: CN2BFPG1F5
[Dec-3 5:50 PM] David Nalley: ^^^^^^^^^ points to this impedance mismatch for the postmortem
[Dec-3 5:50 PM] David Nalley: no label on the switch?
2054 UTC 2014-12-03 - Data copy begins
0441 UTC 2014-12-04 - data migration finished
1457 UTC 2014-12-04 - SVN starts working again - testing begins
0647 UTC 2014-12-05 - svn-master is operational again with viewvc
- It took us far too long to spin up replacement machine. This in fact took a few hours due to having to manually build the host from source media and encountering several BIOS/RaidController issues. Our endeavour to have automated provisioning of tin (bare metal) would certainly have improved this time considerably had it been available at the time of the event.
- Many machines pointing to eris.a.o for LDAP - not to a service name (such as ldap1-us-west for example) which meant we couldn’t easily restore LDAP services for some US hosts without making them also think SVN services had also moved.
- Assigning of issues in JIRA - It has perhaps been a long held understanding that if an issue is assigned to someone in JIRA then they are actively managing that issue. This event clearly shows how fragile that belief is.
- DNS (geo) updates were problematic - Daniel will be posting a proposal on Thursday, which will outline our concerns around DNS and a viable way forward that meets our needs and is not reliant on us storing all the data in SVN to be able to effect changes to zones. (This proposal was not created as a tiger of this event, it has been worked on for a number of weeks now).
- architectural problems for availablility
- We couldn't promote svn-eu to master - data differences/corruption https://issues.apache.org/jira/browse/INFRA-6236
- Current monitoring setup was not sufficient in catching disk errors and correctly alerting infra.
- Daniel to investigate and evaluate multimaster service availability.
- Implement an extended SSL check that not only ensures the service is up, but also checks cert validity (expire, revocation status etc), and the certificate chain is valid.
- De-couple DNS from SVN
- De-couple the SVN authz file from SVN directly. Also breser@ has suggested we use the authz validation tool available from the svn install we have on hades, as part of the template->active file generation process.
- Move the ASF status page (http://status.apache.org) outside of our main colos so folks can continue to see it in the event of an outage.
- Vendor provided hardware monitoring tools mandatory on all new hardware deployments.
- Broader audience for incidents and status reports
- More aggressive host replacement before these issues arise
Things being considered
- Mandatory use of SNMP for enhanced data gathering.
- Issue ‘nagging’ - develop some thoughts and ideas around the concept of auto-transitioning un-modified JIRA issues after N hours of in activity and actively nag the group until an update is made. This for example is how Atlasssian (and so many others) handle their issues. For example if an end-user doesn’t update the issue within 5 days, it is automatically closed, if we don’t update an open issue within 6 hours for a critical issue then we get nagged about it.
- Automatically create new JIRA issues (utilising above mentioned auto-transition) to notify of hardware issues (not just relying on hundreds of cron emails a day).
- Again as part of a wider thinking of how we use issue tracking consider the concept that you only assign an issue to yourself if you are explicitly working on it at that moment, i.e it should not sit in the queue assigned to someone for > N hours and not receive any updates.
Things that went well
- The people working on the issue worked extremely well as a team. Communicating with one another via hipchat and helping each other along where required. There was a real sense of camaraderie for the first time in a very long time and this see of team helped greatly.
- The team put in a bloody hard shift.
- There is now a very solid understanding of the SVN service across at least 4 members of the team, as opposed to 2 x0.5 understandings before.
- A much broader insight into the current design of our infrastructure was gained by the newer members of the team.
MoinMoin Service - User Account Tidy Up
In recent months we have become increasingly aware of a slowing down of our MoinMoin wiki service. We have attributed this, at least in part, due to the way MoinMoin stores some data about user accounts.
Across all of our wiki instances (in the farm) we had a little over 1.08 million distinct user accounts. Many of which have never been used (spam etc). So we have decided to archive all users who have not accessed any of the wiki sites they were registered for in more than 128 days.
This has resulted in us being able to archive a little over 800k users. This leaves us with around 200k users across 77 wikis. This still feels very high, and in the coming weeks we will investigate further still in how we can better understand if those remaining accounts are making valid changes, or are they just link farm home pages.
If you think your account was affected by this, and you would like to have your account restored, then please contact the Infra team using this page http://www.apache.org/dev/infra-contact
ASF Infra Team
Posted at 12:17PM Nov 21, 2014 by pctony in General | |
Code signing service now available
The ASF Infrastructure team is pleased to announce the availability of a new code signing service for Java, Windows and Android applications. This service is available to any Apache project to use to sign their releases. Traditionally, Apache projects have shipped source code. The code tarballs are signed with a GPG signature to allow users and providers to verify the code's authenticity, but users have either compiled their own applications or some projects have provided convenience binaries. With projects like Apache OpenOffice, users expect to receive binaries that are ready to run. Today's desktop and mobile operating systems expect that binaries will be signed by the vendor -- which had left a gap to be filled for Apache projects.
After a great deal of research, we have chosen Symantec's Secure App Service offering to provide code signing service. This allows us to granularly permit access; and each PMC will have their own certificate(s) for signing. The per-project nature of certificate issuance allows us to revoke a signature without disrupting other projects.
This service will permit projects to sign artifacts either via a web GUI or a SOAP API. In addition a Java client and an ant task for signing have been written and a maven plugin is under development.
This service results in a 'pay for what you use' scenario, so PMCs are asked to use the service responsibly. To that end, projects will have access to a test environment to ensure that they have their process working correctly before consuming actual credits.
Thus far, we've had two projects who have helped testing this and working out process for which we are very grateful. Those projects, Commons and Tomcat, have successfully released signed artifacts recently. (Commons Daemon 1.0.15 and Tomcat 8.0.14)
Posted at 04:36PM Oct 06, 2014 by markt in General | |
GitHub pull request builds now available on builds.apache.org
The ASF Infrastructure team is happy to announce that you can now set up jobs on builds.apache.org to listen for pull requests to github.com/apache repositories, build that pull request’s changes, and then comment on the pull request with the build’s results. This is done using the Jenkins Enterprise GitHub pull request builder plugin, generously provided to the ASF by our friends at CloudBees. We've set up the necessary hooks on all github.com/apache repositories that are up as of Wednesday, Oct 1, 2014, and will be adding the hooks to all new repositories going forward.
Here’s what you need to do to set it up:
- Create a new job, probably copied from an existing job.
- Make sure you’re not doing any “mvn deploy” or equivalent in the new job - this job shouldn’t be deploying any artifacts to Nexus, etc.
- Check the "Enable Git validated merge support” box - you can leave the first few fields set to their default, since we’re not actually pushing anything. This is just required to get the pull request builder to register correctly.
- Set the “GitHub project” field to the HTTP URL for your repository - i.e.,"http://github.com/apache/incubator-brooklyn/"- make sure it ends with that trailing slash and doesn’t include .git, etc.
- In the Git SCM section of the job configuration, set the repository URL to point to the GitHub git:// URL for your repository - i.e., git://github.com/apache/incubator-brooklyn.git.
- You should be able to leave the “Branches to build” field as is - this won’t be relevant anyway.
- Click the “Add” button in “Additional Behaviors” and choose "Strategy for choosing what to build”. Make sure the choosing strategy is set to “Build commits submitted for validated merge”.
- Uncheck any existing build triggers - this shouldn’t be running on a schedule, polling, running when SNAPSHOT dependencies are built, etc.
- Check the “Build pull requests to the repository” option in the build triggers.
- Optionally change anything else in the job that you’d like to be different for a pull request build than for a normal build - i.e., any downstream build triggers should probably be removed, you may want to change email recipients, etc.
- Save, and you’re done!
Now when a pull request is opened or new changes are pushed to an existing pull request to your repository, this job will be triggered, and it will build the pull request. A link will be added to the pull request in the list of builds for the job, and when the build completes, Jenkins will comment on the pull request with the build result and a link to the build at builds.apache.org.
In addition, you can also use the "Build when a change is pushed to GitHub" option in the build triggers for non-pull request jobs, instead of polling - Jenkins receives notifications from GitHub whenever one of our repositories has been pushed to. Jenkins can then determine which jobs use that repository and the branch that was pushed to, and trigger the appropriate build.
If you have any questions or problems, please email firstname.lastname@example.org or open a BUILDS JIRA at issues.apache.org.
Posted at 01:00PM Oct 02, 2014 by abayer in General | |
Committer shell access to people.apache.org
Apache committers are granted shell access to a host know as either people.apache.org or minotaur. As you may know, there has been a two year grace period in which we have advertised the upcoming change away from password logins to SSH key only.
Due to a recent significant increase in security issues, the Infrastructure team has taken steps to complete the implementation of key-only logins to protect ASF computing resources.
If you can't access the host anymore then it is very likely you do not have your key stored in LDAP. Please check your LDAP data in https://id.apache.org - and add your key(s) if they are not present. If neccessary, ensure your keys are loaded locally (for linux see http://linux.die.net/man/1/ssh-add and http://linux.die.net/man/1/ssh-agent)
The host will pick up this change within 5 minutes of you making your change and you should be able to get in again.
As always if you have any issues please open a JIRA issue in the INFRA project and we will help you as soon as we can.
Posted at 11:38PM Sep 25, 2014 by pctony in Development | |
Committers mail relay service
For a very long time now we have allowed committers to send email from their @apache.org email address from any host. 10 years ago this was less of an issue than it is today. In the current world of mass spam and junk flying around, mail server providers are trying to find better ways to implement a sense of safety from this for their users. One such method is SPF . These methodologies check that incoming email actually originated via a valid mail server for the senders domain.
For example if you send from email@example.com, but you just send that via your ISP at home, it could be construed as being junk as it never came via an apache.org mail server. Some time ago we setup a service on people.apache.org to cater for this, but it was never enforced and it seems that the SMTP daemon running the service is not 100% RFC compliant and thus some people have been unable to use this service.
As of today, we have stood up a new service on host mail-relay.apache.org that will allow committers to send their apache.org emails via a daemon that is RFC compliant and uses your LDAP credentials. You can read here  what settings you will need to be able to use this service.
On Friday October 10th, at 13:00 UTC the old service on people.apache.org will be terminated, and the updates to the DNS to enforce sending of all apache.org email to have originated via an ASF mail server will be enabled. This means that as of this time if you do not send your apache.org email via mail-relay it is very likely that the mail will not reach it's destination.
When we say 'send your apache.org email' - we mean that when you send *from* your firstname.lastname@example.org email. Emails sent *to* any apache.org email address will not affected by this.
Nexus reduced performance issues resolved.
So Tuesday morning we got a report in IRC that a committer was trying to get a release out
and could not deploy. Shortly after a Nexus issue was reported in Jira INFRA-8321. A few
hours later another issue INFRA-8322 related to Nexus was opened. So far, nothing unusual
Yesterday, more issues reported on IRC/HipChat, and more issues opened.
(INFRA-8326,INFRA-8327,INFRA-8328, INFRA-8334). By then it was obvious this more than
a coincidence and it was already being looked into.
Twitter notifications and emails were sent out declaring the degraded performance an outage
and On Call was full time looking into the issue. Others joined the call to assist and eventually
the outage was determined to be a change to LDAP configuration made 2 days ago by Infra.
(See infra:r921805 for the revert of that.)
The LDAP change was made to improve response times as it was being reported as being slow
to return queries. Reverting the change cured the issues Nexus was having contacting the
groups that committers belonged to.
There will be another avenue looked into for improving LDAP query response times whilst not
affecting those services that connect via anon bind.
Infra thanks everyone for their patience whilst this was looked into and resolved.
Thanks go to those involved in working towards the solution:-
Gavin McDonald (gmcdonald)
Tony Stevenson (pctony)
Chris Lambertus (cml)
Daniel Gruno (humbedooh)
Brian Fox (brianf)
Posted at 09:19AM Sep 11, 2014 by administrator in Status | |
On-demand slaves from Rackspace added to builds.apache.org
Posted at 01:00PM Sep 04, 2014 by abayer in General | |
Infrastructure Team Adopting an On-Call Rotation
As the Apache Software Foundation (ASF) has grown, the infrastructure required to support its diverse set of projects has grown as well. To care for the infrastructure that the ASF depends on, the foundation has hired several contractors to supplement the dedicated cadre of volunteers who help maintain the ASFs hardware and services. To best utilize the time of our paid contractors and volunteers, the Infrastructure team will be adopting an on-call rotation to meet requests and resolve outages in a timely fashion.
Why We're Establishing an On-Call Rotation
In groups, especially groups that are charged with overlapping duties, there's occasionally a sense of diffusion of responsibility. There tends to be a good number of tasks or incidents that routinely occur that need a clear owner. We've also tried to set expectations around our service levels relative to the importance of a service. In example, a new mailing list can be set up as convenient, but a failing mail service needs to be addressed immediately.
The technical side of this has been that we have historically alerted via email and/or SMS about any urgent issues that came up. Of course those alerts went to everyone on the team. If the alert occurs at an inconvenient time, either everyone responds, which is likely wasteful, or no one responds thinking someone else will.
At the Infrastructure team's face to face meeting in July we decided we'd adopt an on-call rotation for the contractors so that everyone wasn't responsible for everything all of the time. We then went looking for something to let us sanely (and without building it ourselves) deal with that.
We ended up choosing PagerDuty, which has a number of ways of receiving alerts. More importantly, it allows us to set a schedule, easily override it for holidays or illnesses, and do so programmatically. It also seamlessly integrates with HipChat, which Infrastructure is running a trial of and communicates with our mobile devices.
PagerDuty also supports a clear escalation path that begins alerting other people about issues if the person on-call fails to respond in a timely manner. Additionally, PagerDuty's mobile apps are built with Apache Cordova, which is an interesting circle. We've finished our trial and decided to adopt PagerDuty. PagerDuty was especially gracious and made our account gratis.
Adopting an on-call rotation will allow us to provide a better service and response time, while also clearly setting expectations around contractor availability so they can relax on their off weeks.
If you have questions or want to get involved, feel free to join us on the infrastructure mailing list email@example.com or joining us in our Hipchat room.
Posted at 01:00PM Aug 18, 2014 by ke4qqq in General | |
New status page for the ASF
We are pleased to announce that we have a new status page for our infrastructure and the ASF as a whole.
Where we have previously been focused on reporting the up/down status of our services, we have now begun to look a bit more at the broader picture of the ASF; What's going on, who is committing how much, where are emails going, what's going on on GitHub mirrors and so on, as well as tracking uptime and availability for our public services that power the ASF's online presence.
The result of this broader scope can be seen on: http://status.apache.org
It is our hope that you'll find this new status page informative and helpful, both in times of trouble and times where everything is in working condition.
Posted at 01:45PM Aug 14, 2014 by humbedooh in Status | |
Email from apache.org committer accounts bypasses moderation!
Good news! We've finally laid the necessary groundwork to extend the bypassing of committer emails sent from their apache.org addresses, from commit lists to now all Apache mailing lists. This feature was activated earlier today and represents a significant benefit for cross-collaboration between Apache mailing lists for committers, relieving moderators of needless burden.
Also we'd like to remind you of the SSL-enabled SMTP submission service we offer committers listening on people.apache.org port 465. Gmail users in particular can enjoy a convenient way of sending email, to any recipient even outside apache.org, using their apache.org committer address. For more on that please see our website's documentation.
To complement these features we'd also like to remind committers of the ability to request an "owner file" be added to their email forwarder by filing an appropriate INFRA jira ticket. Owner files alleviate most of the problems associated with outside organizations, who may be running strict SPF policies, attempting to reach you at your apache.org address. Without an owner file those messages will typically bounce back to those organizations instead of successfully reaching you at your target forwarding address. For those familiar with SRS, this is a poor-man's version of that specification's feature set. Please direct your detailed questions about owner files to the firstname.lastname@example.org mailing list.
NOTE: we've extended this bypass feature to include any committer email addresses listed in their personal LDAP record with Apache.
Posted at 02:29AM Jun 15, 2014 by joes in General | |
DMARC filtering on lists that munge messages
Since Yahoo! switched their DMARC policy in mid-April, we've seen an increase in undeliverable messages sent from our mail server for Yahoo! accounts subscribed to our lists. Roughly half of Apache's mailing lists do some form of message munging, whether it be Subject header prefixes, appended message trailers, or removed mime components. Such actions are incompatible with Y!'s policy for its users, which has meant more bounces and more frustration trying to maintain inclusive discussions with Y! users.
Since Y!'s actions are likely just the beginning of a trend towards strong DMARC policies aimed at eliminating forged emails, we've taken the extraordinary step of munging Y! user's From headers to append a spec-compliant .INVALID marker on their address, and dropping the DKIM-Signature: header for such messages. We are an ezmlm shop and maintain a heavily customized .ezmlmrc file, so carrying this action out was relatively straightforward with a 30-line perl header filter prepended to certain lines in the "editor" block of our .ezmlmrc file. The filter does a dynamic lookup of DMARC "p=reject" policies to inform its actions, so we are prepared for future adopters beyond the early ones like Yahoo!, AOL, Facebook, LinkedIn, and Twitter. Interested parties in our solution may visit this page for details and the Apache-licensed code.
Of course this filter only applies to half our lists- the remainder that do no munging are perfectly compatible with DMARC rejection policies without modification of our list software or configuration. Apache projects that prefer to avoid munging may file a Jira ticket with infrastructure to ask that their lists be set to "ezmlm-make -+ -TXF" options.
Posted at 09:57PM Jun 03, 2014 by joes in General | |
Mail outage post-mortem
During the afternoon of May 6th we began experiencing delays in mail delivery of 1-2 hours. Initial efforts at remediation seemed to clear this up but on the morning of May 7th the problem worsened and we proactively disabled mail service to deal with the failure. This outage affected all ASF mailing lists and mail forwarding. The service remained unavailable until May 10th, and it took almost 5 additional days to fully flush the backlog of messages.
You can find a timeline here that was kept during the incident: https://blogs.apache.org/infra/entry/mail_outage
This was a catastrophic failure for the Apache Software Foundation as email is core to virtually every operation and is our primary communication medium.
The mail service at the ASF is composed of three physical servers. Two of these are external facing mail exchangers that receive mail. The final server handles mailing list expansion, alias forwarding and mail delivery in general. That latter server had two volumes that experienced a disk outage each. This degraded performance substantially and led to the mail delays seen on May 6th and 7th. The service was proactively disabled on May 7th in an attempt to let the arrays rebuild without the significant disk I/O overhead caused by processing the large mail backlog. Ultimately multiple attempts to rebuild the underlying arrays failed and eventually other drives in the array where the data volume was stored failed rendering recovery a hopeless task on May 8th. We began working to restore backups from our offsite backup location to our primary US datacenter. When this began to take longer than expected, additional concurrent efforts began to restore service in one of our secondary datacenters as well as in a public cloud instance. Ultimately we ended up completing the restoration to our primary US datacenter first and were able to bring the service online. When the service resumed, we had an estimated 10 million message backlog in addition to our normal 1.7-2 million ongoing daily message flow. The amount of backlogged mail taxed the existing infrastructure and architecture of the mail service and took almost 5 days to completely clear.
Our backups were sufficient to allow us to restore the service in good working order.
Early precautions taken when we discovered the problem combined with our backups resulted in no data loss from the incident.
Our mail exchangers continued to work during the outage and held incoming mail until the service was restored.
What didn't work:
Our monitoring was not sufficient to identify the problem or alert us to the symptoms.
No spare hard drives for this class of machine were on-hand in our primary datacenter.
The restore time from our remote backups took an excessively long time. This was partially due to the large size of the restore data, and partially due to the transport method used for the data.
After the service was restored we had approximately a 10M message backlog that took days to clear.
The primary administrator of the service was on vacation, and the remaining infrastructure contractors were not intimately familiar with the service.
Our documentation was insufficient to easily restore the service in a rapid manner by folks without intimate knowledge.
Our immediate action items:
- Update the documentation to be current/diagram mail flow.
- Improve the monitoring of the mail service itself as well as the hardware.
- Insure we have adequate spares on hand for the majority of our core services.
- Place our mail server under configuration management to reduce our MTTR
Medium-to-Long term initiatives.
- Crosstraining contractors in all critical services
- Work on moving to a more fault-tolerant/redundant architecture
- More fully deploy our config management and automated provisioning across our infrastructure so MTTR is reduced.
Posted at 05:16AM May 28, 2014 by ke4qqq in General | |
New monitoring system: nagios is dead long live circonus
23 may 2014 the old monitoring system "nagios" was put to sleep, and "circonus" was given production status.
The new monitoring system is sponsored by circonus and most of the monitoring as well as the central database runs on www.circonus.com. The infrastructure team have built and deployed logic around the standard circonus system:
- A private broker, to monitor internal services without exposing them on internet
- A dedicated broker (inhouse development) that monitor special ASF systems (like svn compare US - EU)
- A configuration system, that are based on svn.
- A new status page status.apache.org
- A new team structure (all committers with sudo karma on a vm, get an email when something happens with the vm)
The new system is a lot faster and we can therefore offer projects monitoring of project URLs, of course the project also need to have a team that handles the alerts.
The current version has approx. the same facilities as Nagios, but we are planning (and actively programming) a version.2 that will allow us to better predict problems before they occur.
Some of the upcoming features are:
- disk monitoring
- vital data statistic from core system (like size of mail queues)
The change of monitoring system is a vital component in our transition to automate services and thereby enable infra to more effectively secure the stability of the infrastructure as well as make early detection of potential problems.
The system was presented in Apachecon denver 2014, slides can be found here. We hope to present the live version at apachecon budapest 2014.
On behalf of the infrastructure team
Posted at 10:29PM May 23, 2014 by jani in Status | |