Apache Infrastructure Team
Nexus reduced performance issues resolved.
So Tuesday morning we got a report in IRC that a committer was trying to get a release out
and could not deploy. Shortly after a Nexus issue was reported in Jira INFRA-8321. A few
hours later another issue INFRA-8322 related to Nexus was opened. So far, nothing unusual
Yesterday, more issues reported on IRC/HipChat, and more issues opened.
(INFRA-8326,INFRA-8327,INFRA-8328, INFRA-8334). By then it was obvious this more than
a coincidence and it was already being looked into.
Twitter notifications and emails were sent out declaring the degraded performance an outage
and On Call was full time looking into the issue. Others joined the call to assist and eventually
the outage was determined to be a change to LDAP configuration made 2 days ago by Infra.
(See infra:r921805 for the revert of that.)
The LDAP change was made to improve response times as it was being reported as being slow
to return queries. Reverting the change cured the issues Nexus was having contacting the
groups that committers belonged to.
There will be another avenue looked into for improving LDAP query response times whilst not
affecting those services that connect via anon bind.
Infra thanks everyone for their patience whilst this was looked into and resolved.
Thanks go to those involved in working towards the solution:-
Gavin McDonald (gmcdonald)
Tony Stevenson (pctony)
Chris Lambertus (cml)
Daniel Gruno (humbedooh)
Brian Fox (brianf)
Posted at 09:19AM Sep 11, 2014 by administrator in Status | |
On-demand slaves from Rackspace added to builds.apache.org
Posted at 01:00PM Sep 04, 2014 by abayer in General | |
Infrastructure Team Adopting an On-Call Rotation
As the Apache Software Foundation (ASF) has grown, the infrastructure required to support its diverse set of projects has grown as well. To care for the infrastructure that the ASF depends on, the foundation has hired several contractors to supplement the dedicated cadre of volunteers who help maintain the ASFs hardware and services. To best utilize the time of our paid contractors and volunteers, the Infrastructure team will be adopting an on-call rotation to meet requests and resolve outages in a timely fashion.
Why We're Establishing an On-Call Rotation
In groups, especially groups that are charged with overlapping duties, there's occasionally a sense of diffusion of responsibility. There tends to be a good number of tasks or incidents that routinely occur that need a clear owner. We've also tried to set expectations around our service levels relative to the importance of a service. In example, a new mailing list can be set up as convenient, but a failing mail service needs to be addressed immediately.
The technical side of this has been that we have historically alerted via email and/or SMS about any urgent issues that came up. Of course those alerts went to everyone on the team. If the alert occurs at an inconvenient time, either everyone responds, which is likely wasteful, or no one responds thinking someone else will.
At the Infrastructure team's face to face meeting in July we decided we'd adopt an on-call rotation for the contractors so that everyone wasn't responsible for everything all of the time. We then went looking for something to let us sanely (and without building it ourselves) deal with that.
We ended up choosing PagerDuty, which has a number of ways of receiving alerts. More importantly, it allows us to set a schedule, easily override it for holidays or illnesses, and do so programmatically. It also seamlessly integrates with HipChat, which Infrastructure is running a trial of and communicates with our mobile devices.
PagerDuty also supports a clear escalation path that begins alerting other people about issues if the person on-call fails to respond in a timely manner. Additionally, PagerDuty's mobile apps are built with Apache Cordova, which is an interesting circle. We've finished our trial and decided to adopt PagerDuty. PagerDuty was especially gracious and made our account gratis.
Adopting an on-call rotation will allow us to provide a better service and response time, while also clearly setting expectations around contractor availability so they can relax on their off weeks.
If you have questions or want to get involved, feel free to join us on the infrastructure mailing list email@example.com or joining us in our Hipchat room.
Posted at 01:00PM Aug 18, 2014 by ke4qqq in General | |
New status page for the ASF
We are pleased to announce that we have a new status page for our infrastructure and the ASF as a whole.
Where we have previously been focused on reporting the up/down status of our services, we have now begun to look a bit more at the broader picture of the ASF; What's going on, who is committing how much, where are emails going, what's going on on GitHub mirrors and so on, as well as tracking uptime and availability for our public services that power the ASF's online presence.
The result of this broader scope can be seen on: http://status.apache.org
It is our hope that you'll find this new status page informative and helpful, both in times of trouble and times where everything is in working condition.
Posted at 01:45PM Aug 14, 2014 by humbedooh in Status | |
Email from apache.org committer accounts bypasses moderation!
Good news! We've finally laid the necessary groundwork to extend the bypassing of committer emails sent from their apache.org addresses, from commit lists to now all Apache mailing lists. This feature was activated earlier today and represents a significant benefit for cross-collaboration between Apache mailing lists for committers, relieving moderators of needless burden.
Also we'd like to remind you of the SSL-enabled SMTP submission service we offer committers listening on people.apache.org port 465. Gmail users in particular can enjoy a convenient way of sending email, to any recipient even outside apache.org, using their apache.org committer address. For more on that please see our website's documentation.
To complement these features we'd also like to remind committers of the ability to request an "owner file" be added to their email forwarder by filing an appropriate INFRA jira ticket. Owner files alleviate most of the problems associated with outside organizations, who may be running strict SPF policies, attempting to reach you at your apache.org address. Without an owner file those messages will typically bounce back to those organizations instead of successfully reaching you at your target forwarding address. For those familiar with SRS, this is a poor-man's version of that specification's feature set. Please direct your detailed questions about owner files to the firstname.lastname@example.org mailing list.
NOTE: we've extended this bypass feature to include any committer email addresses listed in their personal LDAP record with Apache.
Posted at 02:29AM Jun 15, 2014 by joes in General | |
DMARC filtering on lists that munge messages
Since Yahoo! switched their DMARC policy in mid-April, we've seen an increase in undeliverable messages sent from our mail server for Yahoo! accounts subscribed to our lists. Roughly half of Apache's mailing lists do some form of message munging, whether it be Subject header prefixes, appended message trailers, or removed mime components. Such actions are incompatible with Y!'s policy for its users, which has meant more bounces and more frustration trying to maintain inclusive discussions with Y! users.
Since Y!'s actions are likely just the beginning of a trend towards strong DMARC policies aimed at eliminating forged emails, we've taken the extraordinary step of munging Y! user's From headers to append a spec-compliant .INVALID marker on their address, and dropping the DKIM-Signature: header for such messages. We are an ezmlm shop and maintain a heavily customized .ezmlmrc file, so carrying this action out was relatively straightforward with a 30-line perl header filter prepended to certain lines in the "editor" block of our .ezmlmrc file. The filter does a dynamic lookup of DMARC "p=reject" policies to inform its actions, so we are prepared for future adopters beyond the early ones like Yahoo!, AOL, Facebook, LinkedIn, and Twitter. Interested parties in our solution may visit this page for details and the Apache-licensed code.
Of course this filter only applies to half our lists- the remainder that do no munging are perfectly compatible with DMARC rejection policies without modification of our list software or configuration. Apache projects that prefer to avoid munging may file a Jira ticket with infrastructure to ask that their lists be set to "ezmlm-make -+ -TXF" options.
Posted at 09:57PM Jun 03, 2014 by joes in General | |
Mail outage post-mortem
During the afternoon of May 6th we began experiencing delays in mail delivery of 1-2 hours. Initial efforts at remediation seemed to clear this up but on the morning of May 7th the problem worsened and we proactively disabled mail service to deal with the failure. This outage affected all ASF mailing lists and mail forwarding. The service remained unavailable until May 10th, and it took almost 5 additional days to fully flush the backlog of messages.
You can find a timeline here that was kept during the incident: https://blogs.apache.org/infra/entry/mail_outage
This was a catastrophic failure for the Apache Software Foundation as email is core to virtually every operation and is our primary communication medium.
The mail service at the ASF is composed of three physical servers. Two of these are external facing mail exchangers that receive mail. The final server handles mailing list expansion, alias forwarding and mail delivery in general. That latter server had two volumes that experienced a disk outage each. This degraded performance substantially and led to the mail delays seen on May 6th and 7th. The service was proactively disabled on May 7th in an attempt to let the arrays rebuild without the significant disk I/O overhead caused by processing the large mail backlog. Ultimately multiple attempts to rebuild the underlying arrays failed and eventually other drives in the array where the data volume was stored failed rendering recovery a hopeless task on May 8th. We began working to restore backups from our offsite backup location to our primary US datacenter. When this began to take longer than expected, additional concurrent efforts began to restore service in one of our secondary datacenters as well as in a public cloud instance. Ultimately we ended up completing the restoration to our primary US datacenter first and were able to bring the service online. When the service resumed, we had an estimated 10 million message backlog in addition to our normal 1.7-2 million ongoing daily message flow. The amount of backlogged mail taxed the existing infrastructure and architecture of the mail service and took almost 5 days to completely clear.
Our backups were sufficient to allow us to restore the service in good working order.
Early precautions taken when we discovered the problem combined with our backups resulted in no data loss from the incident.
Our mail exchangers continued to work during the outage and held incoming mail until the service was restored.
What didn't work:
Our monitoring was not sufficient to identify the problem or alert us to the symptoms.
No spare hard drives for this class of machine were on-hand in our primary datacenter.
The restore time from our remote backups took an excessively long time. This was partially due to the large size of the restore data, and partially due to the transport method used for the data.
After the service was restored we had approximately a 10M message backlog that took days to clear.
The primary administrator of the service was on vacation, and the remaining infrastructure contractors were not intimately familiar with the service.
Our documentation was insufficient to easily restore the service in a rapid manner by folks without intimate knowledge.
Our immediate action items:
- Update the documentation to be current/diagram mail flow.
- Improve the monitoring of the mail service itself as well as the hardware.
- Insure we have adequate spares on hand for the majority of our core services.
- Place our mail server under configuration management to reduce our MTTR
Medium-to-Long term initiatives.
- Crosstraining contractors in all critical services
- Work on moving to a more fault-tolerant/redundant architecture
- More fully deploy our config management and automated provisioning across our infrastructure so MTTR is reduced.
Posted at 05:16AM May 28, 2014 by ke4qqq in General | |
New monitoring system: nagios is dead long live circonus
23 may 2014 the old monitoring system "nagios" was put to sleep, and "circonus" was given production status.
The new monitoring system is sponsored by circonus and most of the monitoring as well as the central database runs on www.circonus.com. The infrastructure team have built and deployed logic around the standard circonus system:
- A private broker, to monitor internal services without exposing them on internet
- A dedicated broker (inhouse development) that monitor special ASF systems (like svn compare US - EU)
- A configuration system, that are based on svn.
- A new status page status.apache.org
- A new team structure (all committers with sudo karma on a vm, get an email when something happens with the vm)
The new system is a lot faster and we can therefore offer projects monitoring of project URLs, of course the project also need to have a team that handles the alerts.
The current version has approx. the same facilities as Nagios, but we are planning (and actively programming) a version.2 that will allow us to better predict problems before they occur.
Some of the upcoming features are:
- disk monitoring
- vital data statistic from core system (like size of mail queues)
The change of monitoring system is a vital component in our transition to automate services and thereby enable infra to more effectively secure the stability of the infrastructure as well as make early detection of potential problems.
The system was presented in Apachecon denver 2014, slides can be found here. We hope to present the live version at apachecon budapest 2014.
On behalf of the infrastructure team
Posted at 10:29PM May 23, 2014 by jani in Status | |
During the afternoon of May 6th we began experiencing delays in mail delivery of 1-2 hours. Initial efforts at remediation seemed to clear this up but on the morning of May 7th the problem worsened and we proactively disabled mail service to deal with the failure. The underlying hardware suffered failures on multiple disks. This outage effects all ASF mailing lists and mail forwarding.
This service is housed at OSUOSL, and we are currently waiting on smart hands to help with replacing hardware. Our expectation at this point is that we still have multiple hours worth of outage.
Incoming mail is currently being received and held in queue by our mail exchangers. We also have a copy of the existing queue that hasn't been processed; so we expect no mail or data loss.
ASF Infra's twitter bot will provide updates as we have them for the duration of the outage. Feel free to follow @infrabot on Twitter. There will be an update on this post as well as the situation progresses.
UPDATE 7 May 19:27 UTC - Drives have been replaced, array is attempting to rebuild. As indicated earlier on twitter, there likely remains hours of outage.
UPDATE 7 May 20:44 UTC - The disk array is still in the process of repairing. Several hundred mails were processed during a reboot, but more work remains before service is restored. Mail service has been disabled again as the array repair process is CPU-bound. The plan going forward is to allow the disk arrays to finish repairs. Once that is complete, we'll reenable the mail service and flush what is currently in the queue. Finally, once the queue is empty we'll begin receiving mail again.
UPDATE 8 May 05:00 UTC - The disk array failed to repair itself. The disks have been replaced and a new installation has been completed. Progress continues to be made towards resolution, but nothing firm enough yet for us to predict an time for restoration.
UPDATE 8 May 15:45 UTC - No material change of status has occurred. Infra worked in shifts around the clock last night and continue to do so to restore service. More updates as they become available.
UPDATE 9 May 11:20 UTC - We are working on temporarily restoring the most essential email aliases. In the meantime, inquiries may be made to email@example.com or on our IRC channel, #asfinfra on Freenode. The work on restoring the service in full is still ongoing.
UPDATE 9 May 17:20 UTC - We've successfully restored a host from backups and will be starting testing soon. Based on the progress made in those tests we'll try and provide expectations around restoration of service timeline.
UPDATE 10 May 15:45 UTC - We've started pushing live mails through the system - you'll begin to see them trickle in as we gradually open the floodgates to restore service. Expect intermittent spurts for a while.
UPDATE 10 May 21:55 UTC - The floodgates have been opened. As we have a significant amount of backlog to catch up on, please be patient as the service does this. As always feel free to contact us if you have any questions. In the immediate short term (next day or so, we suggest you continue to use firstname.lastname@example.org and our IRC channel, #asfinfra on Freenode. We would like to thank you for your patience during this extremely busy time.
UPDATE 12 May 16:04 UTC - Clarification - we have opened the floodgates, but have a substantial amount of backlog; and with the sudden rush of mail are being throttled by various mail services. With the addition of mail thats coming through anyway; it may take us from 2-5 days to fully flush the backlog. This time is so wide because of a wide variety of factors that are largely outside of our control, such as new mail coming in and mail services individual throttling policies.
heartbleed fallout for apache
What we've learned about the heartbleed incident is that it is hard, in the sense of perhaps only viable to a well-funded blackhat operation, to steal a private certificate and key from a vulnerable service. Nevertheless, the central role Apache projects play in the modern software development world require us to mitigate against that circumstance. Given the length of time and exposure window for this bug's existence, we have to assume that some/many Apache passwords may have been compromised, and perhaps even our private wildcard cert and key, so we've taken a few steps as of today:
- We fixed the vulnerability in our openssl installations to prevent further damage,
- We've acquired a new wildcard cert for apache.org that we have rolled out prior to this blog entry,
- We will require that all committers rotate their LDAP passwords (committers visit id.apache.org to reset LDAP passwords once they've been forcibly reset),
- We are encouraging all service administrators to all non-LDAP service like jira to rotate those passwords as well.
Regarding the cert change for svn users- we'd also like to suggest that you remove your existing apache.org certs from your .subversion cache to prevent potential MITM attacks using the old cert. Fortunately it is relatively painless to do this:
% grep -l apache.org ~/.subversion/auth/svn.ssl.server/* | xargs rm
NOTE: our openoffice wildcard cert was never vulnerable to this issue as it was served from an openssl-1.0.0 host.
Scaling down the CMS to modest but intricate websites
The original focus of the CMS was to provide the tools necessary for handling http://www.apache.org/ and similar Anakia-based sites. The scope quickly changed when Apache OpenOffice was accepted into the incubator... handling over 9GB of content well was quite an undertaking and will be soon discussed at Apachecon in Denver during Dave Fisher's talk. From there the build system was extended to allow builds using multiple technologies and programming languages.
Since that time in late 2012 the CMS codebase has sat still, but recently we've upped the ante and decided to offer features aimed at parity with other site building technologies like jekyll, nanoc and middleman. You can see some of the new additions to the Apache CMS in action at http://thrift.apache.org/. The Apache Thrift website was originally written to use nanoc before being ported to the newly improved Apache CMS. They kept the YAML headers for their markdown pages and converted from a custom preprocessing script used for inserting code snippets to using a fully-supported snippet-fetching feature in the Apache CMS.
"The new improvements to the Apache CMS allowed us to quickly standardize the build process and guarantee repeatable results as well as integrate direct code snippets into the website from our source repository."
- Jake Farrell, Apache Thrift PMC Chair
Posted at 06:23PM Mar 25, 2014 by joes in General | |
Improved integration between Apache and GitHub
After a few weeks of hard work and mind-boggling debugging, we are pleased to announce tighter and smarter integration between GitHub and the Apache Software Foundation's infrastructure.
These new features mean a much higher level of replication and retention of what goes on on GitHub, which in turns both help projects maintain control over what goes on within their project, as well as keeping a record of everything that's happening in the development of a project, whether it be on ASF hardware or off-site on GitHub.
To be more precise, these new features allows for the following:
- Any Pull Request that gets opened, closed, reopened or commented on now gets recorded on the project's mailing list
- If a project has a JIRA instance, any PRs or comments on PRs that include a JIRA ticket ID will trigger an update on that specific ticket
- Replying to a GitHub comment on the dev@ mailing list will trigger a comment being placed on GitHub (yes, it works both ways!)
- GitHub activity can now be relayed to IRC channels on the Freenode network.
As with most of our things, this is an opt-in feature. If you are in a project that would like to take advantage of these new features, please contact infrastructure, preferably by filing a JIRA ticket with the component set to Git, and specifying which of the new features you would like to see enabled for your project.
On behalf of the Infrastructure Team, I hope you will find these new features useful and be mindful in your use of them.
Posted at 01:16AM Feb 12, 2014 by humbedooh in Development | |
paste.apache.org sees the light of day
Today, the Apache Infrastructure team launched http://paste.apache.org, a new ASF-driven site for posting snippets, scripts, logging output, configurations and much more and sharing them with the world.
Why yet another paste bin, you ask?
Well, for starters, this site is different in that is it run by the ASF, for the ASF, in that we fully control what happens to your data when you post it, or perhaps more important, what does NOT happen to it. The site enforces a "from committers to everyone" policy, meaning only committers may post new data on the site, but everyone is invited to watch the result. While this is not a blanket guarantee that the data is accurate or true, it is nonetheless a guarantee that what you see is data posted by an Apache committer.
Secondly, committers have the option to post something as being "committers only", meaning only committers within the ASF can see the paste. This is much like the "private" pastes offered by many other sites, but with the added benefit that it prevents anyone snooping around from watching whatever you paste, unless they are actually a committer.
Great, so how does it work?
It works like most other paste sites, in that you go to http://paste.apache.org, paste your data, select which type of highlighting to use, and you get an URL with your paste. For text-only clients, raw data will be displayed, while regular browsers will enjoy a full web page with the ability to download or edit a paste. Currently we have support for httpd configurations, C/C++, Java, Lua, Erlang, XML/HTML, PHP, Shell scripts, Diff/Patch, Python and Perl syntax highlighting. If you want to have any other type of highlighting added, don't hesitate to ask!
Since this site enforces the "from committers to everyone" policy, you are required to use your LDAP credentials when making a paste. To allow for the use of the service within console applications (shells etc) that might not (or should not) provide authentication credentials (on public machines you'd want to avoid storing your committer credentials for instance!), we have equipped the site with a token generator, that both allows you to pipe any output you may have directly to the site as well as gives you some hints on how you may achieve this.
Imagine you have a directory listing that you'd only want your fellow committers to see. Publishing this, using the token system, is as easy as doing:
$> ls -la | privpaste
And there you have it, the command returns a URL ready for sharing with your fellow committers. Had you wanted for eveyone to be able to see it, you could have used the pubpaste alias instead (click on "generate token" on the site to get more information about tokens and the useful aliases).
We hope you'll enjoy this new service, and use it wisely as well as often. Should you have any questions or suggestions, we'd be most happy to receive them through any infra channel you want to use.
Posted at 06:37PM Mar 06, 2013 by humbedooh in General | |
New Infra Team Members
Since out last update over a year ago, the Infra Team has expanded by another NINE (9) members!
Congrats and our warmest thanks go to:
Niklas Gustavsson - (ngn)
Jeremy Thomerson - (jrthomerson)
Mark Struberg - (struberg)
Eric Evans - (eevans)
Brandon Williams - (brandonwilliams)
Mohammad Nour El-Din - (mnour)
David Nalley - (ke4qqq)
Yang Shih-Ching - (imacat)
Daniel Gruno - (humbedooh)
The rest of the Infra team look forward to continuing to work with you all.
There are now a total of 80 infrastructure members with another 36 in the infrastructure-interest group.
ASF Comments System Live!
Daniel Gruno has recently developed a comments system for Apache projects to use. The purpose of the system is to enable public commentary on project webpages and is already in production use in the httpd and trafficserver projects. This new system nicely complements the ASF CMS system and trivially integrates with it- see http://comments.apache.org/help.html for details.
The comment system is now open- enjoy! Please file a jira ticket with INFRA to get started today.
Posted at 04:49PM Jul 09, 2012 by joes in General | |