By Michael Russo
Based on recent testing at Apigee, the upcoming Apache Usergrid 2 release is set to be the most scalable open-source Backend as a Service available. We were able to drive Usergrid to 10,000 transactions per second and, more importantly, found that Usergrid can scale horizontally. Here's the story of how we got there.
Apache Usergrid is a software stack that enables you to run a (Backend-as-a-Service) BaaS that can store, index, and query JSON objects. It also enables you to manage assets and provide authentication, push notifications, and a host of other features useful to developers—especially those working on mobile apps.
The project recently graduated from the Apache Incubator and is now a top-level project of the Apache Software Foundation (ASF). Usergrid is new at Apache, but Apigee has been using it in production for three years now as the foundation for Apigee's API BaaS product.
Usergrid 2 features the same REST API as Usergrid 1, but under the hood just about everything has changed. Usergrid 2 includes a completely new persistence engine backed by Apache Cassandra and a query/indexing service backed by ElasticSearch. We brought ElasticSearch into Usergrid because the query/index service in Usergrid 1 was not performing well and was complex and difficult to maintain. ElasticSearch does a much better job of query/index than we could have done ourselves. Additionally, separating key-value persistence from index/query allows us to scale each concern separately.
As the architecture of Usergrid changed drastically, we needed to have a new baseline performance benchmark to make sure the system scaled as well as, if not better than, it did before. Let's talk about how we tested.
The Usergrid team has invested a lot of time building repeatable test cases using the Gatling load-testing framework. Performance is a high priority for us and we need a way to validate performance metrics for every release candidate.
As Usergrid is open source, so are our Usergrid-specific Gatling scenarios, which you can find here: stack/loadtests (on Github).
One of our goals was to prove that we had the ability to scale more requests per second with more hardware, so we started small and worked our way up.
As the first in our series of new benchmarking for Usergrid, we wanted to start with a trivial use case to establish a solid baseline for the application. All testing scenarios use the HTTP API and test the concurrency and performance of the requests. We inserted a few million entities that we could later read from the system. The test case itself was simple. Each entity has a UUID (universally unique identifier) property. For all the entities we had inserted, we randomly read them out by their UUID:
First, we tried scaling the Usergrid application by its configuration. We configured a higher number of connections to use for Cassandra and a higher number of threads for Tomcat to use. This actually yielded higher latencies and system resource usage for marginally the same throughput. We saw better throughput when there was less concurrency allowed. This made sense, but we needed more, and immediately added more Usergrid servers to verify horizontal scalability. What will it take to get to 10,000 RPS?
|# Usergrid Servers||# Cassandra Nodes||Peak Requests Per Second|
|Switch to nine c3.2xlarge instances for Cassandra|
|Switch to nine c3.4xlarge instances for Cassandra|
It was time to see if Cassandra was keeping up. As we scaled up the load we found Cassandra read operation latencies were also increasing. Shouldn't Cassandra handle more, though? We observed a single Usergrid read by UUID was translating to about 10 read operations to cassandra. Optimization #1: reduce the number of read operations from Cassandra on our most trivial use case. Given what we know, we still decided to test up to a peak 10,000 RPS in the current state.
The cluster was scaled horizontally (more nodes) until we needed to vertically scale (bigger nodes) Cassandra due to high CPU usage. We stopped at 10,268 Requests Per Second with 35 c3.xlarge Usergrid servers and 9 c3.4xlarge Cassandra nodes. By this point numerous opportunities for improvement were identified in the codebase, and we had already executed on some. We fully expect to reach the same throughput with much less infrastructure in the coming weeks. In fact, we've already reached ~7,800 RPS with only 15 Usergrid servers since our benchmarking.
Here are the components that we used in our Usergrid performance testing:
As part of benchmarking, we wanted to ensure that all configurations and deployment scenarios exactly matched how we would run a production cluster. The main configurations that are recommended for production use of Usergrid are:
|Usergrid (Application)||Tomcat (Container)||Cassandra (Database)|
|1||LOCAL QUORUM read and write consistency set for Cassandra operations||Blocking IO connector ( required in Usergrid)||6+ node cluster size|
|2||Configure separate Keyspace used for Locks vs. Main Usergrid application||Use HTTP 1.1 and ensure keepAlive is configured|
|3||Configure max # of connections used per Cassandra node to be 15||Non-SSL connector (SSL typically handled by a load balancer)||Replication Factor = 3|
As part of this testing, not only did we identify code optimizations that we can quickly fix for huge performance gains, we also learned more about tuning our infrastructure to handle high concurrency. Having this baseline gives us the motivation to continually improve performance of the Usergrid application, reducing the cost for operating a BaaS platform at huge scale.
This post is just the start of our performance series. Stay tuned, as we’ll be publishing more results in the future for the following Usergrid scenarios:
See you next time!