Apache Infrastructure Team

Monday Jun 29, 2015

Buildbot master currently off-line

Update (2015-06-30 ~12.00 UTC):

The replacement buildbot master is now live. The CMS service and the ci.apache.org  website have been restored. The project CI builds are mostly working but builds that upload docs, snapshots etc. to the buildmaster for publishing are likely to fail at the upload stage while we ensure all the necessary directory structures are in place to receive the uploads. Work to resolve these final few issues is ongoing.

We continue to try and contact the owner of the account where the IRC proxy was running. In case their account has been compromised, it remains locked. In addition, all their commits have been reviewed by other project committers and that review has comfirmed that no malicious commits have been made by the account in question.

The review of aegis.apache.org  is ongoing. No evidence of compromise beyond the possible compromise of the single, non-privileged user account has been found.

Original post (2015-06-29 ~21.00 UTC):

As per the e-mails to committers@ earlier today, aegis.apache.org is currently offline after a report was received that suspicious network traffic had been observed from that host. This blog post will be updated as more information becomes known.

What we know:

  • At ~16.00 UTC 28 June 2015 a report of suspicious network activity from a buildbot host was reported to the Apache security team.
  • Further information was requested and at ~18.00 UTC 28 June 2015 the Apache Infrastructure team received a copy of network logs that showed a number of suspicious IRC connections originating from aegis.apache.org
  • These IRC connections were traced to a non-privileged user account on aegis.apache.org  running an open IRC proxy
  • At ~20.00 UTC 28 June 2015 the user account concerned was locked for all ASF services and the proxy process terminated.
  • At ~10.00 UTC 29 June 2015, after further discussion within the infrastructure team, aegis.apache.org was taken off-line as a precaution.

It remains unclear whether the open IRC proxy was installed by the user that owned the account or whether their account was compromised and the IRC proxy was installed by an unauthorized user.

It is worth stressing that no further information came to light between 20.00 UTC 28 June 2015 and 10.00 UTC 29 June 2015 that triggered the decision to take the host off-line. The host was taken off-line purely as a precaution while we reviewed the available information. That process is ongoing. So far we have found no evidence to even suggest anything more than a user account being used to run an IRC proxy and plenty of evidence that suggests that this was the only activity this account was used for.

Risks:

There is no risk to released source or binaries for any ASF project. There are multiple reasons for this:

  • buildbot is a CI system used to build snapshots, not releases
  • no builds are performed on aegis.apache.org

Buildbot is used to build some project web sites and / or project documentation. The risk of compromise here is viewed as very low for the following reasons:

  • the builds do not take place on aegis.apache.org
  • diffs of every change are sent to the relevant project team's mailing list for review and an unexpected / malicious change would be spotted

Project impact:

The following services are currently off-line and will remain so until the buildbot master is restored

  • All buildbot builds
  • Projects that use the CMS will be unable to update their web sites (the CMS uses buildbot to build web site updates)
  • the ci.apache.org  website

Work in progress:

Analyzing aegis.apache.org  is going to take time and, while we view the chances of a wider compromise of this host as very, very small, we are not willing to bring the host back on line at this point. This host was due for replacement so the decision has been taken to pull this work forward and rebuild the buildbot master on a new host now. We have taken this decision not because we believe aegis.apache.org  to be compromised, but because it is possible to complete this work far more quickly than it is possible to confirm our view that aegis.apche.org is not compromised.  We currently estimate that the rebuild of the new buildbot master host will be completed by 1 July 2015.

We continue to analyze the information we have obtained from aegis.apache.org  and from other sources and will update this blog post as more information becomes available.

Questions:

Questions, concerns, comments etc. should be directed to infrastructure@apache.org

Wednesday Jun 10, 2015

Confluence Wiki service to be restarted

Hi All,

There will be a planned reboot of Confluence on Friday 12th June at 18:00 UTC+1

This is a blog post notice as recommended in our Core Services planned downtime SLA.

The Confluence wiki service configuration is stored in our Puppet configuration.

We have made some modifications to the Puppet Manifest affecting the Module that
Confluence uses (cwiki_asf). Some code is being moved out from the module and
into a host specific YAML file. This will make it easier for future hosts to re-use the
module (such as an upgrade host currently awaiting these changes.)
A twitter notification will be posted 1 hour before.
A planned maintenance notice will be posted on status.apache.org.

If necessary we will make use this outage window to apply any OS updates and reboot
the host VM.

Actual downtime should be no more than 1 hour all being well.

An email about this will be sent to infrastructure@ after the service has resumed from the planned downtime.

Monday May 18, 2015

Planned downtime for Jira

Hi All,

There will be a planned reboot of Jira on Thursday 21st May at 16:00 UTC+1

This is 72 hours notice as recommended in our Core Services planned downtime SLA.

Currently, Jira requires a reboot when adding new projects to it. There is an outstanding
ticket with Atlassian about this. They require logs and so these will be gathered at the
time of the planned reboot.

Projects being added to Jira at this time will include:-

INFRA-9516 - Myriad
INFRA-9609 - Atlas
INFRA-9635 - CMDA

and any more that get requested between now and downtime.

Any projects requiring issues to be imported from other issue trackers will NOT be done at
this time.

A tweet via @infrabot will be tweeted 24 hrs and 1 hr before.
A planned maintenance notice will be posted on status.apache.org.

Actual downtime should be no more than 10 minutes all being well.

The next email about this will be after the service has resumed from the planned downtime.

Thanks!

Gav…

Friday May 08, 2015

Mail Service Architecture Changes

For the past few months the Infrastructure team have been working extremely hard to re-design, implement and manage changes to the email service architecture.  Today we are proud to announce that phase 1 of this has been completed, and has been running for several days now.

Phase 1 covers all components of the service except the listserv service, and mail archives.  These will be included in phase 2, which we will come onto later. When we started out on this project to review, update and manage our email infrastructure we had a several guiding principals that either the old system must be made to conform too; or any new service would need to meet before being accepted.  When we talk about these principals really we are talking about criteria, these are: 

  • The service must be entirely managed (operationally) from our puppet service. 
  • The software (packages) must all be packaged - i.e. .deb's, either upstream or packaged locally and in our own repo. Deploying from source is no longer acceptable.
  • All the work carried out by puppet et al must be idempotent
  • We will not allow the service design to restrict our ability to either adapt it, or grow it at will and on demand. 

Very early on in the design and testing work it became clear that we needed clear separation of each of the roles in the email service infrastructure. This would allow us, in the future too add more capability of any given type if for some reason it were needed. Lets say for example we needed for SpamAssassin capability this can we scaled sideways and allow us to swallow the load without needing to also make it an MX host or listserv host etc. 

The design we have settled upon, with phase 1 complete can be seen in this diagram. http://www.apache.org/dev/mailflow.jpg - This diagram shows that we have deployed several MX hosts (each of which are more than capable of handling our entire inbound mail load comfortably); in differing AWS regions globally. This decision means that while we dont need 3 to cope with capacity we wanted 3 to cope with networking resilience should any of these instances suffer network degradation or outage.  

These MX hosts are simple Postfix instances that run Postfix Postscreen, RBL checks, and Amavisd-new.  This simple protection of only performing RBL checks at the edge frees up the internal scanning hosts from having to scan emails needlessly. Amavis is simply used to pass the emails internally for scanning. 

Once the mails have been passed on by the MX (and there is an interesting detail about how exactly the mails are handled by Amavis that might be a blog post in the near future) they are handled by our scanning cluster. This group of hosts utilise SpamAssassin, ClamAV and again Postfix. While these may not be new technologies, again having a dedicated host or hosts in our case allows us to tune the services specifically for the resources dedicated to scanning and not worry about choking other local services. Of course it also means that should we see a marked increase in mail volume we can easily deploy a new node in a matter of minutes and have it join the rotation and start scanning email.

All of the scanning nodes are being fronted by a HAProxy instance, this allows us to load balance our nodes and not have to reconfigure the MX hosts should we change the number of scanning hosts.  It also means we can take a node out of rotation for maintenance and none of the MX hosts need to be reconfigured or modified in anyway.

As we said earlier this is only phase 1.  You will see in the diagram that we are still running our old ezmlm/qmail stack. This will now become the focus of phase 2, to determine what changes, if any best suit our projects and the foundation as a whole. One of the failings of the current system is that if the listserv host goes down, mail basically stops flowing, as this is the authoritative host for all apache addresses. We will also be looking very hard as to how we can run multiple listserv hosts to remove that single point of failure concern. 

The foundation relies on email as it's official internal communication mechanism, this is evident no more than when we say "If it didn't happen on the list, it didn't happen". Moving this service forward will be a significant challenge, one which we hope to deliver as soon as we can. 

As always, if you have any questions please email infrastructure@apache.org  and we will do what we can to help.

On behalf of the Infrastructure Team
--pctony  

Wednesday Apr 29, 2015

Apache Services and SHA-1 SSL Cert deperecation

As some of you may have already encountered, certain services within Apache appear to have broken SSL support. While the cert is still valid, there is a part of the cert that both Microsoft and Google have stopped accepting as valid. We are working on fixing this and will use this blogpost to track what services will be updated and when (as well as emails).

Services:

  • git-wip-us
  • TLP sites
  • SSL terminator (erebus-ssl)
  • svn-master
  • mail-relay

Schedule:

  • git-wip-us: Friday May 1, 16:00 UTC
  • TLP sites: Friday May 1, 16:00 UTC
  • SSL terminator (erebus-ssl): Friday May 1, 16:00 UTC
  • svn-master: Friday May 1, 16:00 UTC
  • mail-relay: Friday May 1, 16:00 UTC

Git based websites available

If you have worked on a web site for an Apache project, you've probably come across the fact that everything has to be in Subversion for web sites. The reason for this has been the desire to have a unified standard for publishing web site contents across all projects. The current workflow is handled by two components, svnpubsub - a pubsub service for subversion - and svnwcsub, the client for svnpubsub. In 2013 we added a similar method for Git, called gitpubsub. Nowadays, gitpubsub is used for a ton of different service messages in the ASF; Git commits, JIRA notifications, GitHub communication and so on, and as of today, we have added gitwcsub, a gitpubsub client similar to svnwcsub, enabling projects to use git as their repository for web site content.

 In order to use git as your web site repository, you must have your web site in a git repo. This can either be an existing repository or a new one created just for your web site. gitwcsub will, by default, pull content from the asf-site branch of any repo set up for it, so all that needs to be done is to have this branch in a repo on git-wip-us.apache.org and you can have your projects site published via git.

To have your site transferred to a git based workflow, please file a JIRA ticket with infrastructure.

Lastly, we want to thank the CouchDB project for being guinea pigs in this process!

Wednesday Apr 15, 2015

Apache gains additional Travis-CI capacity

Travis-CI is a distributed continuous integration platform that integrates well with projects on Github. As many of our projects are taking advantage of our Github integration, they're also making use of Travis-CI for testing of inbound patches.

Travis CI offers a free account for open source projects, with a built in assumption that projects are generally a single project per Github organization. The level of resources and jobs able to run is 'fair use', which is fair indeed considering that is gratis.

Of course, most Github organizations aren't as large as the Apache organization on Github, and we recently discovered that the Foundation was one of the largest gratis open source user of Travis CI. On average, our build queue length was in excess of 300 jobs. While we appreciate the generosity of the Travis-CI folks, our demand for their services was clearly outstripping the available supply. This also meant that a lot of Apache projects were frustrated, or even abandoning their efforts to use Travis-CI because the length of time for a build to start was high enough to not really quality as 'continuous'.

To that end, we've now purchased a subscription to Travis services, and have moved from 'fair use' to having 30 concurrent builds. This should be a dramatic increase in throughput for Apache projects who make use of Travis.

Monday Apr 13, 2015

Introducing JIRA Service Desk

As part of our ongoing efforts to streamline our service offerings, and to make it easier to interact with the Infrastructure team we are launching an instance of JIRA Service Desk. 

This should make it much simpler to submit common JIRA issues, such as SVN->GIT migration, New wiki, New JIRA project, etc. The forms ask for the minimum amount of data we would need to complete the request. 

One common theme we found that delayed resolution was needing additional information to action tickets. Service Desk allows us to request the exact information needed for a specific task. 

We would like to ask everyone to start using this to submit new issues. You can access this new service here:  https://helpinfrahelpyou.apache.org   or  https://infrahelp.apache.org

Friday Feb 27, 2015

Towards a redeployable future, or how I stopped worrying and learned to love setting the execute bit on CGI files

Things change, even within the ASF.

One of these changes is to our infrastructure, and is a move from manually managed and maintained web servers towards re-deployable, configuration managed servers that tend to themselves and rarely, if ever, require manual intervention. As such, we have started moving towards no longer manually fixing bugs that creep up on various project web sites, in particular setting the correct permission on files. This means that all projects are now required to check their download scripts and verify that the executable flag is set on these CGI files. If not, your download page will likely not work.

Whenever we receive an email from a user of an Apache project about an error on a project web site, we will forward this to the respective project, but we ask that projects take proactive measures and check their download scripts (and any other scripts they may have) to ensure that they have the right permissions set and work.

 Projects using the CMS system will, for the time being, have to commit the execute bit changes directly to the staging repo for their site.

With regards,
Daniel on behalf of the Infrastructure Team.

Monday Jan 12, 2015

Downtime notice for the R/W git repositoies

Folks,

Please note than on Thursday 15th at 20:00 UTC the Infrastructure team
will be taking the read/write git repositories offline.  We expect
that this migration to last about 4 hours.

During the outage the service will be migrated from an old host to a
new one.   We intend to keep the URL the same for access to the repos
after the migration, but an alternate name is already in place in case
DNS updates take too long.   Please be aware it might take some hours
after the completion of the downtime for github to update and reflect
any changes.

The Infrastructure team have been trialling the new host for about a
week now, and [touch wood] have not had any problems with it.

The service is current;y available by accessing repos via:
https://git-wip-us.apache.org

If you have any questions please address them to infrastructure@apache.org

Tuesday Dec 09, 2014

SVN Service Outage - PostMortem

Summary

On Wednesday December 3rd the main US host for the ASF subversion service fails resulting in loss of service.  This loss of subversion service prevent committers from submitting any changes, and whilst we have an EU mirror it is read-only and does not allow for any changes to be submitted whilst the master is offline.

The cause of the outage was a failed disk. This failed disk was part of a mirrored OS pair.  Some time prior to this the alternate disk had also been replaced due to a failed state.

Timeline

0401 UTC 2014-10-26 - eris daily run output notes the degraded state of root disk gmirror
1212 UTC 2014-10-30 - INFRA-8551 created to deal with gmirror degradation.
2243 UTC 2014-12-02 - OSUOSL replaced disk in eris
0208 UTC 2013-12-03 - Subversion begins to crawl to a halt
0756 UTC 2013-12-03 - First contractor discovers something awry with subversion service
0834 UTC 2013-12-03 - Infrastructure sends out a notice about the svn issue
0916 UTC 2013-12-03 - Response to issue begins
1010 UTC 2013-12-03 - First complaints about mail being slow/down
1025 UTC 2013-12-03 - Discovery that email queue alerts had been silenced.
1225 UTC 2013-12-03 - Discovery that Eris outage affecting LDAP-based services including Jenkins and mail
1613 UTC 2013-12-03 - First attempt at power cycling eris
1717 UTC 2013-12-03 - Concern emerges that the 'good' disk in the mirror isn't.
1744 UTC 2013-12-03 - OSUOSL staff shows up in the office
1752 UTC 2013-12-03 - Blog post went up.
1906 UTC 2014-12-03 - New hermes/baldr (hades) being set up for replacement of eris
1911 UTC 2014-12-03 - #svnoutage clean room in hipchat began
2040 UTC 2014-12-03 - machine finally comes up and is usable.
2050 UTC 2014-12-03 - confusion arises between which switch is in which rack. Impedance mismatch between what OSUOSL calls racks, and what we called racks.
                                      [Dec-3 5:50 PM] Tony Stevenson: whcih rack is this
                                      [Dec-3 5:50 PM] Tony Stevenson: 1, 2 or 3
                                      [Dec-3 5:50 PM] Justin Dugger (pwnguin): 19 
                                      [Dec-3 5:50 PM] David Nalley: what switch?
                                      [Dec-3 5:50 PM] Justin Dugger (pwnguin): HW type: HP      ProCurve 2530-48G                OEM S/N 1: CN2BFPG1F5
                                      [Dec-3 5:50 PM] David Nalley: ^^^^^^^^^ points to this impedance mismatch for the postmortem
                                      [Dec-3 5:50 PM] David Nalley: no label on the switch?
2054 UTC 2014-12-03 - Data copy begins
0441 UTC 2014-12-04 - data migration finished
1457 UTC 2014-12-04 - SVN starts working again - testing begins
0647 UTC 2014-12-05 - svn-master is operational again with viewvc



Problems

  • It took us far too long to spin up replacement machine. This in fact took a few hours due to having to manually build the host from source media and encountering several BIOS/RaidController issues.  Our endeavour to have automated provisioning of tin (bare metal) would certainly have improved this time considerably had it been available at the time of the event.  
  • Many machines pointing to eris.a.o for LDAP - not to a service name (such as ldap1-us-west for example) which meant we couldn’t easily restore LDAP services for some US hosts without making them also think SVN services had also moved. 
  • Assigning of issues in JIRA - It has perhaps been a long held understanding that if an issue is assigned to someone in JIRA then they are actively managing that issue. This event clearly shows how fragile that belief is.
  • DNS (geo) updates were problematic - Daniel will be posting a proposal on Thursday, which will outline our concerns around DNS and a viable way forward that meets our needs and is not reliant on us storing all the data in SVN to be able to effect changes to zones. (This proposal was not created as a tiger of this event, it has been worked on for a number of weeks now).
  • architectural problems for availablility

To Do

  • Daniel to investigate and evaluate multimaster service availability.
  • Implement an extended SSL check that not only ensures the service is up, but also checks cert validity (expire, revocation status etc), and the certificate chain is valid.
  • De-couple DNS from SVN
  • De-couple the SVN authz file from SVN directly. Also breser@ has suggested we use the authz validation tool available from the svn install we have on hades,  as part of the template->active file generation process.
  • Move the ASF status page (http://status.apache.org) outside of our main colos so folks can continue to see it in the event of an outage.
  • Vendor provided hardware monitoring tools mandatory on all new hardware deployments.
  • Broader audience for incidents and status reports
  • More aggressive host replacement before these issues arise 


Things being considered

  • Mandatory use of SNMP for enhanced data gathering. 
  • Issue ‘nagging’ - develop some thoughts and ideas around the concept of auto-transitioning un-modified JIRA issues after N hours of in activity and actively nag the group until an update is made. This for example is how Atlasssian (and so many others) handle their issues.  For example if an end-user doesn’t update the issue within 5 days, it is automatically closed, if we don’t update an open issue within 6 hours for a critical issue then we get nagged about it. 
  • Automatically create new JIRA issues (utilising above mentioned auto-transition) to notify of hardware issues (not just relying on hundreds of cron emails a day).
  • Again as part of a wider thinking of how we use issue tracking consider the concept that you only assign an issue to yourself if you are explicitly working on it at that moment, i.e it should not sit in the queue assigned to someone for > N hours and not receive any updates. 

Things that went well

  • The people working on the issue worked extremely well as a team.  Communicating with one another via hipchat and helping each other along where required.  There was a real sense of camaraderie for the first time in a very long time and this see of team helped greatly. 
  • The team put in a bloody hard shift.
  • There is now a very solid understanding of the SVN service across at least 4 members of the team, as opposed to 2 x0.5 understandings before.
  • A much broader insight into the current design of our infrastructure was gained by the newer members of the team. 

Wednesday Dec 03, 2014

Subversion master undergoing emergency maintenance


[Read More]

Friday Nov 21, 2014

MoinMoin Service - User Account Tidy Up

In recent months we have become increasingly aware of a slowing down of our MoinMoin wiki service.  We have attributed this, at least in part, due to the way MoinMoin stores some data about user accounts.  

Across all of our wiki instances (in the farm) we had a little over 1.08 million distinct user accounts.  Many of which have never been used (spam etc).  So we have decided to archive all users who have not accessed any of the wiki sites they were registered for in more than 128 days.  

This has resulted in us being able to archive a little over 800k users.  This leaves us with around 200k users across 77 wikis. This still feels very high, and in the coming weeks we will investigate further still in how we can better understand if those remaining accounts are making valid changes, or are they just link farm home pages.

If you think your account was affected by this, and you would like to have your account restored, then please contact the Infra team using this page http://www.apache.org/dev/infra-contact


Thanks,
ASF Infra Team

Monday Oct 06, 2014

Code signing service now available

The ASF Infrastructure team is pleased to announce the availability of a new code signing service for Java, Windows and Android applications. This service is available to any Apache project to use to sign their releases. Traditionally, Apache projects have shipped source code. The code tarballs are signed with a GPG signature to allow users and providers to verify the code's authenticity, but users have either compiled their own applications or some projects have provided convenience binaries. With projects like Apache OpenOffice, users expect to receive binaries that are ready to run. Today's desktop and mobile operating systems expect that binaries will be signed by the vendor -- which had left a gap to be filled for Apache projects.  

After a great deal of research, we have chosen Symantec's Secure App Service offering to provide code signing service. This allows us to granularly permit access; and each PMC will have their own certificate(s) for signing. The per-project nature of certificate issuance allows us to revoke a signature without disrupting other projects. 

This service will permit projects to sign artifacts either via a web GUI or a SOAP API. In addition a Java client and an ant task for signing have been written and a maven plugin is under development.

This service results in a 'pay for what you use' scenario, so PMCs are asked to use the service responsibly. To that end, projects will have access to a test environment to ensure that they have their process working correctly before consuming actual credits.

Thus far, we've had two projects who have helped testing this and working out process for which we are very grateful. Those projects, Commons and Tomcat, have successfully released signed artifacts recently. (Commons Daemon 1.0.15 and Tomcat 8.0.14)

Projects that wish to use this service should open an Infra JIRA ticket under the Codesigning component.

Thursday Oct 02, 2014

GitHub pull request builds now available on builds.apache.org

The ASF Infrastructure team is happy to announce that you can now set up jobs on builds.apache.org to listen for pull requests to github.com/apache repositories, build that pull request’s changes, and then comment on the pull request with the build’s results. This is done using the Jenkins Enterprise GitHub pull request builder plugin, generously provided to the ASF by our friends at CloudBees. We've set up the necessary hooks on all github.com/apache repositories that are up as of Wednesday, Oct 1, 2014, and will be adding the hooks to all new repositories going forward.

Here’s what you need to do to set it up:

  • Create a new job, probably copied from an existing job.
  • Make sure you’re not doing any “mvn deploy” or equivalent in the new job - this job shouldn’t be deploying any artifacts to Nexus, etc.
  • Check the "Enable Git validated merge support” box - you can leave the first few fields set to their default, since we’re not actually pushing anything. This is just required to get the pull request builder to register correctly.
  • Set the “GitHub project” field to the HTTP URL for your repository - i.e.,"http://github.com/apache/incubator-brooklyn/"- make sure it ends with that trailing slash and doesn’t include .git, etc.
  • In the Git SCM section of the job configuration, set the repository URL to point to the GitHub git:// URL for your repository - i.e., git://github.com/apache/incubator-brooklyn.git.
  • You should be able to leave the “Branches to build” field as is - this won’t be relevant anyway.
  • Click the “Add” button in “Additional Behaviors” and choose "Strategy for choosing what to build”. Make sure the choosing strategy is set to “Build commits submitted for validated merge”.
  • Uncheck any existing build triggers - this shouldn’t be running on a schedule, polling, running when SNAPSHOT dependencies are built, etc.
  • Check the “Build pull requests to the repository” option in the build triggers.
  • Optionally change anything else in the job that you’d like to be different for a pull request build than for a normal build - i.e., any downstream build triggers should probably be removed,  you may want to change email recipients, etc.
  • Save, and you’re done!

Now when a pull request is opened or new changes are pushed to an existing pull request to your repository, this job will be triggered, and it will build the pull request. A link will be added to the pull request in the list of builds for the job, and when the build completes, Jenkins will comment on the pull request with the build result and a link to the build at builds.apache.org

In addition, you can also use the "Build when a change is pushed to GitHub" option in the build triggers for non-pull request jobs, instead of polling - Jenkins receives notifications from GitHub whenever one of our repositories has been pushed to. Jenkins can then determine which jobs use that repository and the branch that was pushed to, and trigger the appropriate build.

If you have any questions or problems, please email builds@apache.org or open a BUILDS JIRA at issues.apache.org

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation