Apache Infrastructure Team

Thursday March 04, 2010

New slaves for ASF Buildbot

The ASF Buildbot CI instance has just launched two more slaves, expanding the range of platforms it can build and test on[Read More]

Monday February 22, 2010

The ASF LDAP system

When we decided some time ago to start using LDAP for auth{n,z} we had to come up with a sane structure, this is what we have thus far. 

 dc=apache,dc=org
      | ---  ou=people,dc=apache,dc=org
      | ---  ou=groups,dc=apache,dc=org
           | ---  ou=people,ou=groups,dc=apache,dc=org
           | ---  ou=committees,ou=groups,dc=apache,dc=org

 As well as other OUs that contain infrastructure related objects.

So with "dc=apache,dc=org" being our basedn, we decided we needed to keep the structure as simple as possible and placed the following objects in the respective OUs:

  • User accounts -  "ou=groups,dc=apache,dc=org"
  • POSIX groups - "ou=groups,dc=apache,dc=org"
  • User Groups  - "ou=people,ou=groups,dc=apache,dc=org"
  • PMC/Committee groups - "ou=committees,ou=groups,dc=apache,dc=org"
Access to the LDAP infrastructure is connection limited to hosts within our co-location sites.  This is essentially to help prevent unauthorised data leaving our network. 

LDAP, groups and SVN - Coupled together

The infrastructure team have now completed the next stage of the planned LDAP migration.
We have migrated our old SVN authorisation file, and POSIX groups into LDAP data.  SVN access control is now managed using these groups.

This means to change access the Subversion repositories is now as simple as changing group membership. We use some custom perl scripts that build the equivalent authorisation file meaning that we dont need to use the <location> blocks nasty hack to do this.  It also means that all changes, including adding new groups and extending access control is made simple.

ASF PMC chairs, are now able to make changes to their POSIX, and SVN groups whilst logged into people.apache.org - using a selection of scripts:

  • /usr/local/bin/list_unix_groups.pl
  • /usr/local/bin/list_committees.pl
  • /usr/local/bin/modify_unix_groups.pl
  • /usr/local/bin/modify_committees.pl

All of these scripts have a '--help' option to show you how to use them.

What's next?  We are now working on adding a custom ASF LDAP schema, that will allow us to record ASF specific data such as ICLA files and date of membership etc.
We will also be looking at adding support for 3rd party applications such as Hudson, and building an identity management portal where people can manage their own account.

Wednesday February 17, 2010

SVN performance enhancements

Tonight we enabled a pair of Intel X25-M's to serve as l2arc cache for the zfs array which contains all of our svn repositories.  Over the next few hours as these SSD's start serving files from cache, the responsiveness and overall performance of svn on eris (our master US-based server) should be noticably better than it has been lately.

In addition we are planning to install 16GB of extra RAM into eris to improve zfs performance even further, but for now we are hopeful that committers will appreciate the performance we've added tonight.


 

Monday November 09, 2009

What can the ASF Buildbot do for your project?

The below information has just been published to the main  ASF Buildbot URI ci.apache.org/buildbot.html.

A summary of just some of the things the ASF Buildbot can do for your project:

  • Perform per commit build & test runs for your project
  • Not just svn! - Buildbot can pull in from your Git/Mercurial branches too!
  • Build and Deploy your website to a staging area for review
  • Build and Deploy your website to mino (people) for syncing live
  • Automatically Build and Deploy Snapshots to Nexus staging area.
  • Create Nightly and historical zipped/tarred snapshot builds for download
  • Builds can be triggered manually from within your own freenode #IRC Channel
  • An IRCBot can report on success/failures of a build instantly
  • Build Success/Failures can go to your dev/notification mailing list
  • Perform multiple builds of an svn/git commit on multiple platforms asyncronously
  • ASF Buildbot uses the latest RAT build to check for license header issues for all your files.
  • RAT Reports are published live instantly to ci.apache.org/$project/rat-report.[txt|html]
  • As indicated above, plain text or html versions of RAT reports are published.
  • [Coming Soon] - RAT Reports sent to your dev list, only new failures will be listed.
  • [Coming Soon] - Email a patch with inserted ASL 2.0 Headers into your failed files!!
  • Currently Buildbot has Ubuntu 8.04, 9.04 and Windows Server 2008 Slaves
  • [Coming Soon] - ASF Buildbot will soon have Solaris, FreeBSD 8 and Windows 7 Slaves

Dont see a feature that you need? Join the builds.at.apache.org mailing list and request it now, or file a Jira Ticket.

Help is always on hand on the builds.at.apache.org mailing list for any problems or build configuration issues/requests. Or try the #asftest channel on irc.freenode.net for live support.

So now you want your project to use Buildbot? No problem, best way is to file a Jira Ticket. and count to 10 (well maybe a bit longer but it wont be long before you are up and running).

Monday October 12, 2009

DDOS mystery involving Linux and mod_ssl

In the first week of October we started getting reports of performance issues, mainly connection timeouts, on all of our services hosted at https://issues.apache.org/.  On further inspection we noticed a huge amount of "Browser disconnect" errors in the error log right at the beginning of the ssl transaction, on the order of 50 connections / second.  This was grinding the machine to a standstill, so we wrote a quick and dirty perl script to investigate the matter.  Initial reports indicated a ddos attack from nearly 100K machines targetting Apache + mod_ssl's accept loop, and the script was tweaked to filter out that traffic before proxying the connections to httpd.

As we started getting a picture of the IP space conducting the attack, the prognosis looked rather bleak: more and more IP's were getting involved and the ddos traffic continued to increase, getting to the point where Linux was shutting down the ethernet interface.  So we then rerouted the traffic to an available FreeBSD machine, which did a stellar job of filtering out the traffic at the kernel level.  We unfortunately didn't quite realize how good a job FreeBSD was doing, and for a time we were operating under the impression that the ddos was ending.  So we eventually moved the traffic back to brutus, the original Linux host, and patched httpd using code developed by Ruediger Pluem.

And back came the ddos traffic.  In a few days the rate of closed connections had nearly doubled, so we had little choice but to start dumping the most frequent IP addresses into iptables DROP rules.  5000 rules cut the traffic by 2/3 in an instant.  But the problem was growing- our logs indicated there were now over 300K addresses participating in the attack.

We started looking closer at the IP's in an attempt to correlate them with regular http requests.   The only pattern that seemed to emerge was that many of the IP's in question we're also generating spartan  "GET / HTTP/1.1" requests with a single Host: 140.211.11.140 header to port 443.   Backtracking through a year of logs revealed that these spartan requests had been going on since August 6, 2008.  The IP's originating these requests were as varied as, and more often that not matched up with, the rapid closed connection traffic we started seeing in October.

So what exactly is going on here?  The closed connection traffic continues to rise, and the origin of the associated spartan requests is currently unknown.

Wednesday September 02, 2009

apache.org incident report for 8/28/2009

Last week we posted about the security breach that caused us to temporarily suspend some services.  All services have now been restored. We have analyzed the events that led to the breach, and continued to work on improving the security of our systems.

NOTE: At no time were any Apache Software Foundation code repositories, downloads, or users put at risk by this intrusion. However, we believe that providing a detailed account of what happened will make the internet a better place, by allowing others to learn from our mistakes.

What Happened?

Our initial running theory was correct--the server that hosted the apachecon.com (dv35.apachecon.com) website had been compromised. The machine was running CentOS, and we suspect they may have used the recent local root exploits patched in RHSA-2009-1222 to escalate their privileges on this machine. The attackers fully compromised this machine, including gaining root privileges, and destroyed most of the logs, making it difficult for us to confirm the details of everything that happened on the machine. 

This machine is owned by the ApacheCon conference production company, not by the Apache Software Foundation. However, members of the ASF infrastructure team had accounts on this machine, including one used to create backups.

The attackers attempted unsuccessfully to use passwords from the compromised ApacheCon host to log on to our production webservers.  Later, using the SSH Key of the backup account, they were able to access people.apache.org (minotaur.apache.org). This account was an unprivileged user, used to create backups from the ApacheCon host.

minotaur.apache.org runs FreeBSD 7-STABLE, and acts as the staging machine for our mirror network. It is our primary shell account server, and provides many other services for Apache developers. None of our Subversion (version control) data is kept on this machine, and there was never any risk to any Apache source code.

Once the attackers had gained shell access, they added CGI scripts to the document root folders of several of our websites. A regular, scheduled rsync process copied these scripts to our production web server, eos.apache.org, where they became externally visible. The CGI scripts were used to obtain remote shells, with information sent using HTTP POST commands.

Our download pages are dynamically generated, to enable us to present users with a local mirror of our software. This means that all of our domains have ExecCGI enabled, making it harder for us to protect against an attack of this nature.

After discovering the CGI scripts, the infrastructure team decided to shutdown any servers that could potentially have been affected. This included people.apache.org, and both the EU and US website servers. All website traffic was redirected to a known-good server, and a temporary security message was put in place to let people know we were aware of an issue.

One by one, we brought the potentially-affected servers up, in single user mode, using our out of band access. It quickly became clear that aurora.apache.org, the EU website server, had not been affected. Although the CGI scripts had been rsync'd to that machine, they had never been run. This machine was not included in the DNS rotation at the time of the attack.

aurora.apache.org runs Solaris 10, and we were able to restore the box to a known-good configuration by cloning and promoting a ZFS snapshot from a day before the CGI scripts were synced over. Doing so enabled us to bring the EU server back online, and to rapidly restore our main websites. Thereafter, we continued to analyze the cause of the breach, the method of access, and which, if any, other machines had been compromised.

Shortly after bringing up aurora.apache.org we determined that the most likely route of the breach was the backup routine from dv35.apachecon.com. We grabbed all the available logs from dv35.apachecon.com, and promptly shut it down.

Analysis continued on minotaur.apache.org and eos.apache.org (our US server), until we were confident that all remants of the attackers had been removed. As each server was declared clean, it was brought back online.

What worked?

  • The use of ZFS snapshots enabled us to restore the EU production web server to a known-good state.
  • Redundant services in two locations allowed us to run services from an alternate location while continuing to work on the affected servers and services.
  • A non-uniform set of compromised machines (Linux/CentOS i386, FreeBSD-7 amd_64, and Solaris 10 on sparc) made it difficult for the attackers to escalate privileges on multiple machines.

What didn't work?

  • The use of SSH keys facilitated this attack. In hindsight, our implementation left a lot to be desired--we did not restrict SSH keys appropriately, and we were unaware of their misuse.
  • The rsync setup, which uses people.apache.org to manage the deployment of our websites, enabled the attackers to get their files onto the US mirror, undetected.
  • The ability to run CGI scripts in any virtual host, when most of our websites do not need this functionality, made us unneccesarily vulnerable to an attack of this nature.
  • The lack of logs from the ApacheCon host prevents us from conclusively determining the full course of action taken by the attacker. All but one log file were deleted by the attacker, and logs were not kept off the machine.

What changes we are making now?

As a result of this intrusion we are making several changes, to help further secure our infrastructure from such issues in the future. These changes include the following:
  • Requiring all users with elevated privileges to use OPIE for sudo on certain machines.  We already require this in some places, but will expand its use as necessary.
  • Recreating and using new SSH keys, one per host, for backups.  Also enforcing use of the from="" and command="" strings in the authorized keys file on the destination backup server. In tandem with access restrictions which only allow connections from machines that are actually backing up data, this will prevent 3rd party machines from being able to establish an SSH connection. 
    • The command="" string in the authorized_keys file is now explicit, and only allows one way rsync traffic, due to the paths and flags used.
    • New keys have been generated for all hosts, with a minimum key length of at least 4096 bits .
  • The VM that hosted the old apachecon.com site remains powered down, awaiting further detailed analysis.  The apachecon.com website has been re-deployed on a new VM, with a new provider and different operating system.
  • We are looking at disabling CGI support on most of our website systems.  This has led to the creation of a new httpd module that will handle things like mirror locations for downloads.
  • The method by which most of our public facing websites are deployed to our production servers will also change, becoming a much more automated process. We hope to have switched over to a SvnSubPub / SvnWcSub based system within the next few weeks.
  • We will re-implement measures such as IP banning after several failed logins, on all machines. 
  • A proposal has been made to introduce centralized logging. This would include all system logs, and possibly also services such as smtpd and httpd.



Friday August 28, 2009

apache.org downtime - initial report

This is a short overview of what happened on Friday August 28 2009 to the apache.org services.  A more detailed post will come at a later time after we complete the audit of all machines involved.

On August 27th, starting at about 18:00 UTC an account used for automated backups for the ApacheCon website hosted on a 3rd party hosting provider was used to upload files to minotaur.apache.org.  The account was accessed using SSH key authentication from this host.

To the best of our knowledge at this time, no end users were affected by this incident,  and the attackers were not able to escalate their privileges on any machines.

While we have no evidence that downloads were affected, users are always advised to check digital signatures where provided.

minotaur.apache.org runs FreeBSD 7-STABLE and is more widely known as people.apache.org.  Minotaur serves as the seed host for most apache.org websites, in addition to providing shell accounts for all Apache committers.

The attackers created several files in the directory containing files for www.apache.org, including several CGI scripts.  These files were then rsynced to our production webservers by automated processes.  At about 07:00 on August 28 2009 the attackers accessed these CGI scripts over HTTP, which spawned processes on our production web services.

At about 07:45 UTC we noticed these rogue processes on eos.apache.org, the Solaris 10 machine that normally serves our websites.

Within the next 10 minutes we decided to shutdown all machines involved as a precaution.

After an initial investigation we changed DNS for most apache.org services to eris.apache.org, a machine not affected and provided a basic downtime message.

After investigation, we determined that our European fallover and backup machine, aurora.apache.org, was not affected.   While the some files had been copied to the machine by automated rsync processes, none of them were executed on the host, and we restored from a ZFS snapshot to a version of all our websites before any accounts were compromised.

At this time several machines remain offline, but most user facing websites and services are now available.

We will provide more information as we can.

 

Saturday August 01, 2009

Relaying mail from apache.org.

One of the more common issues committers face at Apache is in trying to send mail from their apache.org account.  We've just made that process a whole lot easier by setting up an SSL-enabled, smtp-auth based mail submission service on people.apache.org port 465; which is compatible with gmail's recently announced feature to allow outbound mail from your apache.org address to be directed to people.apache.org, instead of to a gmail server, for delivery.  Say goodbye to all the ezmlm moderation battles: your SMTP envelope sender will now match your From header!

In the future we may wish to tighten up the SPF records for apache.org, so please take advantage of this new service for all outbound delivery of your personal apache.org email.
 

Wednesday July 15, 2009

Public Preview of Drafts feature added to ASF Roller instance

Previously, to be able to preview a draft post by any Roller Blog, one had to be a member user of that blog.

For those that would like an easy way to post previews of drafts for lazy consensus or voting, a script has been setup to allow the preview url that Roller generates to be shared publicly.  For example:

   (roller preview url)
    https://blogs.apache.org/roller-ui/authoring/preview/test/?previewEntry=testing

   (public preview url)
    https://blogs.apache.org/preview/test/?previewEntry=testing

A typical process is to create the blog post, set it up to publish in 3-4 days via the "Advanced Settings", then post the modified preview URL to your dev@ list with the anticipated publish date for lazy consensus.

Projects must opt-in by adding the "preview" user with "Limited" access.

Details here:

http://www.apache.org/dev/blogs.html 

Tuesday July 07, 2009

Confluence 2.10 migration for cwiki.a.o 11 July

The ASF Infrastructure Team will be upgrading the Confluence instance powering http://cwiki.apache.org from Confluence 2.2.9 to Confluence 2.10.3 on July 11 at 0400 UTC, or July 10 at 2100 PST.  The migration is expected to take several hours.  

If you haven't already, this would be a good time to check the test migration instance at:

http://confluence-test.zones.apache.org:8080

Exported pages can be found at http://confluence-test.zones.apache.org:8080/export/SPACE_KEY/PAGE_TITLE.html   If in doubt, find your existing exported pages at http://cwiki.apache.org/, so:

http://cwiki.apache.org/WW/home.html

will become

http://confluence-test.zones.apache.org:8080/export/WW/home.html

As much as possible, the space export templates will be preserved in the migration, although changes to the Confluence UI will mean the exports will look different.

Further updates with regards to the Confluence 2.10.3 migration will posted to this blog.

Update 11-07-2009

The Confluence 2.10.3 upgrade has been completed and all spaces have been exported.  There are a few things to note:

  1. The Gliffy license is out of date.  I'll try to track down a new one.
  2. The visibility plugin doesn't support Confluence 2.10.3.  Not sure if anyone uses it, however.
  3. The exported html, as warned, generally looks a bit different.  Let me know if you have any issues tweaking your template.

Update 11-07-2009 part 2

If, for some reason, your templates didn't get copied over or the exported site is so messed up you need the old version, the old files are available:

Update 14-07-2009

The Gliffy folks were kind enough to give us a new license.  Please re-export any applicable spaces.

Thursday May 21, 2009

It's official, we now have LDAP running!

Earlier this week the Infrastructure team rolled out phase one of the planned LDAP services.  

We are using LDAP for authentication of shell accounts.  For now this is the extent of the implementation, however the next phase should follow this quite quickly.

The next phase will involve moving to LDAP to manage access to our subversion repositories. This is a slightly more complicated migration as we currently use an SVNAuthz file, that contains the appropriate groups and their memberships.  We are currently working on a new template system where by changes to LDAP will trigger a build of the SVNAuthz file based on groups in LDAP.  This means we must watch LDAP changes, work on a template system, and if a new version of the template is checked into Subversion we need to trigger a build again.  This is a work in progress at the moment. 

If you find yourself in the position of needing to change your shell account password you can do it by doing this on the command line "ldappasswd -W -S -A -D uid=availid,ou=people,dc=apache,dc=org"  -- Where availid is your ASF username.   For example  "ldappasswd -W -S -A -D uid=pctony,ou=people,dc=apache,dc=org".  This is far from an elegant solution, but for now it works.  You will be required to enter and confirm your current password, and then enter and confirm your new password choice, followed by your LDAP password (this is your old password) .

We are working on a web portal that will allow users to edit attributes, such as forwarding address, password, etc.  This will be made available as soon as it is ready.  If you don't know your current password, then you will need to email  root@ as per usual. 

You can follow the trials and tribulations of the rollout on my personal blog  

Sunday May 03, 2009

Git support at Apache

Git is a new version control system that has been getting increasingly popular during the past few years. Many Apache contributors have also expressed interested in using Git for working with Apache codebases. While the canonical location of all Apache source code is our Subversion repository, we also want to support developers who prefer to use Git as their version control tool.

Based on work by volunteers on the infrastructure-dev@ mailing list, we have recently set up read-only Git mirrors of many Apache codebases at http://git.apache.org/. These mirrors contain the full version histories (including all branches and tags) of the mirrored codebases and are updated in near real time based on the latest svn commits.

See the documentation and wiki pages for more details about this service and how to best use it. We are also open to good ideas on how to extend or improve this service. Please join the infrastructure-dev@ mailing list for the ongoing discussion!

Monday April 06, 2009

New mailing list for CI Build Services

Established today, we now have a dedicated mailing list to talk about and work out all things to do with our build services. Currently infrastructure provides projects with use of Hudson, Continuum, Gump and now we have another option in Buildbot. Buildbot is a new service here at Apache Infrastructure, currently in its last stages of testing , more info coming soon. 

 All these services and all the projects that use them, are welcome to meet together on the new mailing list. Maybe your project is looking to use one or more of these CI's to build & test their code, build their site, publish to Nexus. Maybe you are already using a CI and want some configuration additions/changes or extra jobs run.

Also look out for us poor souls looking after these instances and the machines they run on - we might need more information from you projects, clarification or updating of build requirements, builds taking too long that needs investigation.

Failing builds are of course for each project to solve code-wise, but be sure that whichever CI(s) you choose, they are there to inform and will give you constant reminders of build failures ;)

 Sign up to the new mailing list - builds-subscribe-AT-apache-DOT-org.

 See you there! 

  

Thursday April 02, 2009

Improving our Subversion Services

This week the ASF Infrastructure Team deployed one of the first major changes to how svn.apache.org works since it was launched, 6 years ago.

We now distribute Subversion traffic to our servers based on the geographic region of a client.

We are using pgeodns, the same software that powers CPAN Search and the NTP Pool.  With pgeodns we can give out different DNS entries to clients, depending on where they are connecting from.  It isn't an exact science, but for most clients it is good enough to find the closer Subversion Server.

If you are connecting from Europe, your client will connect to Harmonia.  Harmonia is a Sun x4150 running FreeBSD 7.0, using ZFS raid2z over  6 disks, hosted in Amsterdam at SURFnet.

Users in North America are directed to Eris, our traditional Subversion Master Server.  Eris is a Dell 2950 also running FreeBSD 7.0, using ZFS raid2z over 4 disks, hosted in Corvallis, Oregon at OSUOSL.

Using svnsync as described in Norman's ApacheCon EU 2009 Talk, we replicate all commits to the master to the slave in real time.  If a commit is made to the slave, we proxy the commit to the master.

Read operations are handled on the nearest mirror, and are much faster for everything from the initial checking out to running an update due to the decreased latency.

While this change should improve the experience significantly, we have some other changes coming up soon for svn.apache.org:

  • Upgrade to Subversion 1.6: Representation Sharing, inode packing, and memcached support should help make our SVN servers even faster.
  • Upgrade both Eris and Harmonia to FreeBSD 7.2-STABLE: The ZFS filesystem is experimental in FreeBSD 7, and there are many stability and performance enhancements available in newer versions.
  • Adding more Geographic Mirrors: Once we are comfortable with the current setup, we would like to expand to another mirror location, hopefully in Australia or Asia. 

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation