Entries tagged [yahoo!]

Sunday April 09, 2017

Accordion: Developer View of In-Memory Compaction

by Anastasia Braginsky (HBase Committer), Eshcar Hillel (HBase Committer) and Edward Bortnikov (Contributor) of Yahoo! Research

In-memory compaction (Accordion project) demonstrated sizable improvement in HBase’s write amplification and read/write performance. In this post, we describe the design behind Accordion’s algorithms, and how it fits within the HBase internals.

What’s New

Accordion affects the regionserver package. Its centerpiece component is the CompactingMemStore class, which inherits from AbstractMemStore, and is sibling to DefaultMemStore. In contrast with DefaultMemStore, which maintains a monolithic dynamic (mutable) index to cell storage, CompactingMemStore manages multiple indexes, ordered by creation time. The youngest index is mutable, whereas the rest are immutable.

Cell indexes are implemented as descendants of the CellSet class that provides the basic NavigableMap access to cells. In addition to the traditional ConcurrentSkipListMap mutable index, Accordion introduces an immutable CellArrayMap index - a space-efficient ordered array that uses binary search. CellArrayMap is allocated on heap.

Accordion introduces the Segment abstraction, which encapsulates the combination of the CellSet and associated metadata (time range tracker, MSLAB reference, size counters, etc.). Beforehand, these (gory) details were managed directly by the MemStore. The abstract Segment class manages a single CellSet and its metadata. It has two subclasses:  MutableSegment and ImmutableSegment. The latter can either manage an immutable CellSet, or provide a read-only wrapper to a mutable CellSet. The CompositeImmutableSegment class extends ImmutableSegment; it provides a similar API for a fixed set of segments.

Segment’s are scannable. The traversal is provided by the SegmentScanner class that implements the KeyValueScanner interface. SegmentScanner exploits the NavigableMap API implemented by the CellSet encapsulated by the segment.

CompactingMemStore manages one MutableSegment (in what follows, active) and multiple ImmutableSegment’s. It supports the top-level scan mechanism via a list of SegmentScanner’s, each referring to one segment. In this context, the MemStoreScanner class became deprecated and was eliminated in HBase 2.0.

Figure 1 depicts the Segment and cell index (NavigableMap) class hierarchies.


Figure 1. Segment and cell index (NavigableMap) class hierarchies.

Immutable segments are created upon in-memory flush. Following this, they travel through an interim pipeline (CompactionPipeline class) to the snapshot buffer from where they are flushed to disk, and finally released. Pipeline is accessed in parallel by multiple tasks; in what follows, we discuss how its thread-safety and correctness are guaranteed. The snapshot is simpler because its content never changes; it is implemented as CompositeImmutableSegment.  

In-memory flushes trigger in-memory compactions. The latter replace one or more segments in pipeline with semantically equivalent but more memory-efficient presentations. The MemStoreCompactor class is an algorithmic tool that implements the in-memory compaction policies. It uses the MemStoreSegmentsIterator helper class to traverse the segments. Figure 2 depicts the classes that implement in-memory compaction.


Figure 2. Classes that implement in-memory compaction.

The  StoreScanner class implements a consistent scan mechanism for HRegion. It maintains a heap of KeyValueScanner’s to merge the MemStore data with the on-disk HFile data. CompactingMemStore returns a subset of these scanners (list of SegmentScanner instances) for all its Segment’s.

MemStoreCompactor exploits the same mechanism, via the MemStoreSegmentsIterator helper; it only iterates through immutable segments. Figure 3 depicts the classes involved in in-memory compaction.


Figure 3. Classes involved in in-memory compaction.

Managing the Compacting Memstore State

MemStore’s in HBase run processing tasks concurrently with serving normal read and write requests - for example, flush data from RAM to disk. In CompactingMemStore, there are more concurrent scenarios, with in-memory flushes and compactions introducing more complexity. Here, pipeline is the most complex since it is accessed by multiple tasks in parallel.

Our guiding principles are:

  1. Correctness. Data retrieval semantics are preserved - in particular, data is never lost.

  2. Performance. Infrequent flushes and compactions, which happen in the background, do not affect the datapath operations, namely scans.

Let us give a quick look at how these principles manifest in the CompactionPipeline design.

Data Structures and Synchronization

Pipeline contains a double-ended queue of ImmutableSegments’s ordered by segment creation time. It is accessed by scans (read) as well as flushes and compactions (update). Since the segments are immutable, it is sufficient to provide the reader with a clone of the queue. One way to go would be to clone upon each scan, under the protection of a reentrant shared lock. We chose a more efficient copy-on-write approach. Namely, only the update operations synchronize on the pipeline. Each update modifies the read-only copy of the queue (volatile reference). The subsequent scans retrieve their clone lock-free. Note that if some segments are removed from the queue by in-memory compaction or disk flush in parallel with an ongoing scan, correctness is not affected because the data does not disappear. Rather, it may be referenced from multiple locations (for instance, both pipeline and snapshot). The scan algorithm filters the duplicates.

In-memory compaction swaps one or more segments in the queue with new (compacted) segments. Similarly to scan, it is a long-running operation, which should not disrupt the concurrent datapath operations. In order to achieve this, we implemented compaction in a non-blocking way. CompactionPipeline maintains a version that is promoted each time the queue tail is modified. When the compaction starts, it records this version. Upon completion, it atomically checks whether the version changed in the meantime, and atomically swaps the segments if it did not. This opportunistic approach succeed in most cases. Since in-memory compaction is an optimization, it is fine for it to fail on rare occasions. The version counter (long) is volatile - that is, changes to it are atomic and immediately observable.

Detailed Scenarios

Scan Operation (in particular, Get). The SegmentScanner’s are created (non-atomically) in the order of data movement between the MemStore segments, to preserve correctness. For example, in the course of scanner set creation a segment can move from active to pipeline, in which case it will be referenced by two scanners - however, no data is lost. The merge algorithm eliminates the redundant results that stem from the overlap.

In-Memory Flush (happens when active overflows). A dedicated worker (1) blocks updates for the region (via RegionServicesForStores), (2) creates a new ImmutableSegment that wraps active, (3) atomically inserts it into pipeline, (4) creates a new MutableSegment and flips the active reference to it, (5) unblocks the updates, and (6) calls MemStoreCompactor.

Disk Flush (happens when the region overflows, and decides to free up space in RAM). A dedicated worker (1) forces in-memory flush (to guarantee there is at least one segment in the pipeline), (2) creates a new CompositeImmutableSegment from all segments in the read-only clone of pipeline and flips the snapshot reference, (3) atomically removes references to segments in snapshot from CompactionPipeline, and (4) scans snapshot (merge across multiple segments) and flushes the results to disk.

In-Memory Compaction (triggered by in-memory flush, except in the disk flush case). (1) Retrieves a versioned copy of pipeline, (2) builds a new (compacted) ImmutableSegment, (3) atomically, if the version did not change, swap one or more segments in pipeline with the new segment (swap target depends on the compaction policy, see below).

Note that all the atomic sections are extremely lightweight. They only include manipulation of a few references, and avoid any computation and copy.

In-Memory Compaction Policies

MemStoreCompactor provides two compaction policies: BASIC and EAGER.

The BASIC policy is a low-cost/low-overhead alternative that merges the indexes of all segments in pipeline into a single flat index. It does not eliminate redundancies, in order to avoid cell data copy. Namely, once the number of segments in pipeline exceeds N, the algorithm scans the CellSet’s of N+1 youngest segments in pipeline, and copies the KeyValue references to a new CellArrayMap. The scan retrieves all the KeyValue’s in the original CellSet’s ordered by key and version (non-SQM matcher).

The EAGER policy is a high-cost/high-reward alternative that both flattens the index and eliminates redundancies across segments. It scans all the segments in pipeline, and merges them into one segment encapsulating a new CellArrayMap index. Redundant data versions are eliminated in the course of scan (SQM matcher). If the MemStore uses MSLAB cell storage, then the data is copied to new (compact) MSLAB’s under the new index. This policy trades extra data copy and GC overhead for maximal memory efficiency.

Disk Flush Policy and WAL Truncation

HBase 2.0 introduces a notion of sloppy MemStore’s - that is, MemStore implementations that dynamically expand and contract their RAM footprint over time. CompactingMemStore is currently the only sloppy MemStore implementation. When a region triggers a flush to disk to free up memory, sloppy stores  are the last candidates for flush. The rationale is that they manage their memory more efficiently than DefaultMemStore by over time, and therefore should be prioritized for remaining in RAM.

Disk flushes trigger WAL truncation (archiving), as the WAL entries corresponding to persisted data versions become obsolete. Region maintains the estimate of the lower bound (minimum sequence id) of non-flushed data among all its stores; the log entries below this bound can be safely removed. Prior to Accordion, this maintenance was simple. Since DefaultMemStore dumps the whole in-memory content to disk, the store-level minimum sequence id was reset when flush was scheduled, and re-installed by the first put operation to occur after the flush.

Since sloppy stores can flush in-memory data to disk partially (for example, CompactingMemStore can flush any suffix of CompactionPipeline) the minimum sequence id maintenance becomes more subtle, to avoid data loss. Namely, every segment maintains its own minimum sequence id, and therefore, the CompactingMemStore lower bound is the minimum among all segments. Note that this is just a conservative estimate. For example, an eager in-memory compaction that happens concurrently to a disk flush might eliminate redundant cells and thereby lift the lower bound. However, this estimate is safe because the value can only monotonously grow over time. It can be safely computed anytime; no atomicity is required while retrieving the segment lower bounds.

If the WAL grows too big despite the truncation efforts, the periodic LogRoller process kicks in and forces a full flush to disk. This generic mechanism guarantees that the recovery after crash does not need to replay the entire history, and also trims the WAL. In other words, however efficient, in-memory compaction does not eliminate disk flushes entirely - rather, it pushes them further into the future. Note that for when EAGER compaction is adopted, periodic flushing is even more important because the WAL stores all the data redundancies that are eliminated by the compaction algorithm.


In this blog post, we covered Accordion’s internals - new classes, relationships, and execution flows. We also zoomed in the synchronization scheme that guarantees thread-safety, and shed light on the compaction policy implementations.

We thank Michael Stack, Anoop Sam John and Ramkrishna Vasudevan for their continuous support that made this project happen.

Accordion: HBase Breathes with In-Memory Compaction

by Anastasia Braginsky (HBase Committer), Eshcar Hillel (HBase Committer) and Edward Bortnikov (Contributor) of Yahoo! Research

Modern products powered by HBase exhibit ever-increasing  expectations from its read and write performance. Ideally, HBase applications would like to enjoy the speed of in-memory databases without giving up on the reliable persistent storage guarantees. We introduce a new algorithm in HBase 2.0, named Accordion, which takes a significant step towards this goal.

HBase partitions the data into regions controlled by a cluster of RegionServer’s. The internal (vertical) scalability of RegionServer is crucial for end-user performance as well as for the overall system utilization. Accordion improves the RegionServer scalability via a better use of RAM. It accommodates more data in memory and writes to disk less frequently. This manifests in multiple desirable phenomena. First, HBase’s disk occupancy and write amplification are reduced. Second, more reads and writes get served from RAM, and less are stalled by disk I/O - in other words, HBase’s performance is increased. Traditionally, these different metrics were considered at odds, and tuned at each other’s expense. With Accordion, they all get improved simultaneously.

Accordion is inspired by the Log-Structured-Merge (LSM) tree design pattern that governs the HBase storage organization. An HBase region is stored as a sequence of searchable key-value maps. The topmost is a mutable in-memory store, called MemStore, which absorbs the recent write (put) operations. The rest are immutable HDFS files, called HFiles. Once a MemStore overflows, it is flushed to disk, creating a new HFile. HBase adopts the multi-versioned concurrency control, that is, MemStore stores all data modifications as separate versions. Multiple versions of one key may therefore reside in MemStore and the HFile tier. A read (get) operation, which retrieves the value by key, scans the HFile data in BlockCache, seeking for the latest version. To reduce the number of disk accesses, HFiles are merged in the background. This process, called compaction, removes the redundant cells and creates larger files.

LSM trees deliver superior write performance by transforming random application-level I/O to sequential disk I/O. However, their traditional design makes no attempt to compact the in-memory data. This stems from historical reasons: LSM trees have been designed in the age when RAM was very short resource, therefore the MemStore capacity was small. With recent changes in the hardware landscape, the overall MemStore memstore managed by RegionServer can be multiple gigabytes, leaving a lot of headroom for optimization.

Accordion reapplies the LSM principle to MemStore, in order to eliminate redundancies and other overhead while the data is still in RAM. Doing so decreases the frequency of flushes to HDFS, thereby reducing the write amplification and the overall disk footprint. With less flushes, the write operations are stalled less frequently as the MemStore overflows, therefore the write performance is improved. Less data on disk also implies less pressure on the block cache, higher hit rates, and eventually better read response times. Finally, having less disk writes also means having less compaction happening in the background, i.e., less cycles are stolen from productive (read and write) work. All in all, the effect of in-memory compaction can be envisioned as a catalyst that enables the system move faster as a whole.

Accordion currently provides two levels of in-memory compaction - basic and eager. The former applies generic optimizations that are good for all data update patterns. The latter is most useful for applications with high data churn, like producer-consumer queues, shopping carts, shared counters, etc. All these use cases feature frequent updates of the same keys, which generate multiple redundant versions that the algorithm takes advantage of to provide more value. On the flip side, eager optimization may incur compute overhead (more memory copies and garbage collection), which may affect response times under intensive write loads. The overhead is high if the MemStore uses on-heap MemStore-Local Allocation Buffer (MSLAB) allocation; this configuration is not advised in conjunction with eager compaction. See more details about Accordion’s compaction algorithms in the next sections.

Future implementations may tune the optimal compaction policy automatically, based on the observed workload.

How To Use

The in-memory compaction level can be configured both globally and per column family. The supported levels are none (legacy implementation), basic, and eager.

By default, all tables apply basic in-memory compaction. This global configuration can be overridden in hbase-site.xml, as follows:





The level can also be configured in the HBase shell per column family, as follows:  

create ‘<tablename>’,


Performance Gains, or Why You Should Care

We stress-tested HBase extensively via the popular Yahoo Cloud Service Benchmark (YCSB). Our experiments used 100-200 GB datasets, and exercised a variety of representative workloads. The results demonstrate significant performance gains delivered by Accordion.

Heavy-tailed (Zipf) distribution. The first experiment exercises a workload in which the key popularities follow the Zipf distribution that arises in most of the real-life scenarios. In this context, when 100% of the operations are writes, Accordion achieves up to 30% reduction of write amplification, 20% increase of write throughput, and 22% reduction of GC. When 50% of the operations are reads, the tail read latency is reduced by 12%.

Uniform distribution. The second experiment exercises a workload in which all keys are equally popular. In this context, under 100% writes, Accordion delivers up to 25% reduction of write amplification, 50% increase of write throughput, and 36% reduction of GC. The tail read latencies are not impacted (which is expected, due to complete lack of locality).

How Accordion Works

High Level Design. Accordion introduces CompactingMemStore - a MemStore implementation that applies compaction internally. Contrast to the default MemStore, which maintains all data in one monolithic data structure, Accordion manages it as a sequence of segments. The youngest segment, called active, is mutable; it absorbs the put operations. Upon overflow (by default, 32MB - 25% of the MemStore size bound), the active segment is moved to an in-memory pipeline, and becomes immutable. We call this in-memory flush. Get operations scan through these segments and the HFiles (the latter are accessed via the block cache, as usual in HBase).

CompactingMemStore may merge multiple immutable segments in the background from time to time, creating larger and leaner segments. The pipeline is therefore “breathing” (expanding and contracting), similar to accordion bellows.

When RegionServer decides to flush one or more MemStore’s to disk to free up memory, it considers the CompactingMemStore’s after the rest that have overflown. The rationale is to prolong the lifetime of MemStore’s that manage their memory efficiently, in order to reduce the overall I/O. When such a flush does happen, all pipeline segments are moved to a composite snapshot,  merged, and streamed to a new HFile.

Figure 1 illustrates the structure of CompactingMemStore versus the traditional design.

Figure 1. CompactingMemStore vs DefaultMemStore

Segment Structure. Similarly to the default MemStore, CompactingMemStore maintains an index on top of cell storage, to allow fast search by key. Traditionally, this index was implemented as a Java skiplist  (ConcurrentSkipListMap) - a dynamic but wasteful data structure that manages a lot of small objects. CompactingMemStore uses a space-efficient flat layout for immutable segment indexes. This universal optimization helps all compaction policies reduce the RAM overhead, even when the data has little-to-none redundancies. Once a segment is added to the pipeline, the store serializes its index into a sorted array named CellArrayMap that is amenable to fast binary search.

CellArrayMap supports both direct allocation of cells from the Java heap and custom allocation from MSLAB’s - either on-heap or off-heap. The implementation differences are abstracted away via the helper KeyValue objects that are referenced from the index (Figure 2). CellArrayMap itself is always allocated on-heap.

Figure 2. Immutable segment with a flat CellArrayMap index and MSLAB cell storage.

Compaction Algorithms. The in-memory compaction algorithms maintains a single flat index on top of the pipelined segments. This saves space, especially when the data items are small, and therefore pushes the disk flush further off away in time. A single index allows searching in one place, therefore bounding the tail read latency.

When an active segment is flushed to memory, it is queued to the compaction pipeline, and a background merge task is immediately scheduled. The latter simultaneously scans all the segments in the pipeline (similarly to on-disk compaction) and merges their indexes into one. The differences between the basic and eager compaction policies manifest in how they handle the cell data. Basic compaction does not eliminate the redundant data versions in order to  avoid physical copy; it just rearranges the references the KeyValue objects. Eager compaction, on the contrary, filters out the duplicates. This comes at the cost of extra compute and data migration - for example, with MSLAB storage the surviving cells are copied to the newly created MSLAB(s). The compaction overhead pays off when the data is highly redundant.

Future implementations of compaction may automate the choice between the basic and eager compaction policies. For example, the algorithm might try eager compaction once in awhile, and schedule the next compaction based on the value delivered (i.e., fraction of data eliminated). Such an approach could relieve the system administrator from deciding a-priori, and adapt to changing access patterns.


In this blog post, we covered Accordion’s basic principles, configuration, performance gains, and some details of the in-memory compaction algorithms. The next post will focus on system internals for HBase developers.

We thank Michael Stack, Anoop Sam John and Ramkrishna Vasudevan for their continuous support that made this project happen.



Hot Blogs (today's hits)

Tag Cloud