Apache Infrastructure Team

Thursday June 30, 2016

ASF JIRA Outages and Troubleshooting

As people have noticed, our JIRA instance (arguably the largest public instance in the world) has been suffering from a yet unknown issue as of late. We are reasonably sure that this is related to specific queries being made against the instance (possibly automated queries from scrapers), but have yet to identify the exact cause of the problem.

The failure condition arises when the database connection pool is exhausted, despite being configured and sized appropriately. These connections all appear idle, but when the pool is full, no new connections can be established, and the instance falls over, requiring a restart. 

We are working closely with Atlassian, the creator of JIRA, to remedy the situation. Unfortunately, this requires running diagnostics on the production JIRA instance, which in and of itself causes performance degradation and downtime. Over the past several days, we've identified and implemented some changes to the pool parameters which we hope will help stabilize the instance while we continue our diagnostic work.

We expect that there may still be some moments of downtime and occasional restarts. Any longer duration outages will be announced via Twitter/infrabot and status.apache.org.


If you are using multiple "Create Sub-tasks" in workflow transitions and are using JIRA Software (Jira 7 + Agile), then most likely you are seeing this bug -- https://jira.atlassian.com/browse/JSW-13756 It's a racing type condition that can happen when there are two near simultaneous transactions going through the same workflow, where sub-tasks are created. The first request puts a lock on the Lexorank table, processes it's first transaction and then gets back in line to run it's next transaction. While this is happening, the second transaction gets in line to use the Lexorank table, but cannot until the first transaction releases the row locks. Since the first transaction is now behind the second transaction in line, we are deadlocked. Any subsequent requests to use the Lexorank table will continue to chew through the database connections until they are gone, which in turn will also cause the Tomcat thread pool to deplete until JIRA is no longer responsive at all.

Posted by Chris Solgat on June 30, 2016 at 05:38 PM UTC #

Thanks Chris for your comments. We are currently behind releases and using 6.3.4 - before the split of products.

Posted by Gavin McDonald on July 02, 2016 at 12:46 AM UTC #

Post a Comment:
Comments are closed for this entry.



Hot Blogs (today's hits)

Tag Cloud