Apache HBase

Sunday April 09, 2017

Accordion: Developer View of In-Memory Compaction

by Anastasia Braginsky (HBase Committer), Eshcar Hillel (HBase Committer) and Edward Bortnikov (Contributor) of Yahoo! Research

In-memory compaction (Accordion project) demonstrated sizable improvement in HBase’s write amplification and read/write performance. In this post, we describe the design behind Accordion’s algorithms, and how it fits within the HBase internals.

What’s New

Accordion affects the regionserver package. Its centerpiece component is the CompactingMemStore class, which inherits from AbstractMemStore, and is sibling to DefaultMemStore. In contrast with DefaultMemStore, which maintains a monolithic dynamic (mutable) index to cell storage, CompactingMemStore manages multiple indexes, ordered by creation time. The youngest index is mutable, whereas the rest are immutable.

Cell indexes are implemented as descendants of the CellSet class that provides the basic NavigableMap access to cells. In addition to the traditional ConcurrentSkipListMap mutable index, Accordion introduces an immutable CellArrayMap index - a space-efficient ordered array that uses binary search. CellArrayMap is allocated on heap.

Accordion introduces the Segment abstraction, which encapsulates the combination of the CellSet and associated metadata (time range tracker, MSLAB reference, size counters, etc.). Beforehand, these (gory) details were managed directly by the MemStore. The abstract Segment class manages a single CellSet and its metadata. It has two subclasses:  MutableSegment and ImmutableSegment. The latter can either manage an immutable CellSet, or provide a read-only wrapper to a mutable CellSet. The CompositeImmutableSegment class extends ImmutableSegment; it provides a similar API for a fixed set of segments.

Segment’s are scannable. The traversal is provided by the SegmentScanner class that implements the KeyValueScanner interface. SegmentScanner exploits the NavigableMap API implemented by the CellSet encapsulated by the segment.

CompactingMemStore manages one MutableSegment (in what follows, active) and multiple ImmutableSegment’s. It supports the top-level scan mechanism via a list of SegmentScanner’s, each referring to one segment. In this context, the MemStoreScanner class became deprecated and was eliminated in HBase 2.0.

Figure 1 depicts the Segment and cell index (NavigableMap) class hierarchies.


Figure 1. Segment and cell index (NavigableMap) class hierarchies.

Immutable segments are created upon in-memory flush. Following this, they travel through an interim pipeline (CompactionPipeline class) to the snapshot buffer from where they are flushed to disk, and finally released. Pipeline is accessed in parallel by multiple tasks; in what follows, we discuss how its thread-safety and correctness are guaranteed. The snapshot is simpler because its content never changes; it is implemented as CompositeImmutableSegment.  

In-memory flushes trigger in-memory compactions. The latter replace one or more segments in pipeline with semantically equivalent but more memory-efficient presentations. The MemStoreCompactor class is an algorithmic tool that implements the in-memory compaction policies. It uses the MemStoreSegmentsIterator helper class to traverse the segments. Figure 2 depicts the classes that implement in-memory compaction.


Figure 2. Classes that implement in-memory compaction.

The  StoreScanner class implements a consistent scan mechanism for HRegion. It maintains a heap of KeyValueScanner’s to merge the MemStore data with the on-disk HFile data. CompactingMemStore returns a subset of these scanners (list of SegmentScanner instances) for all its Segment’s.

MemStoreCompactor exploits the same mechanism, via the MemStoreSegmentsIterator helper; it only iterates through immutable segments. Figure 3 depicts the classes involved in in-memory compaction.


Figure 3. Classes involved in in-memory compaction.

Managing the Compacting Memstore State

MemStore’s in HBase run processing tasks concurrently with serving normal read and write requests - for example, flush data from RAM to disk. In CompactingMemStore, there are more concurrent scenarios, with in-memory flushes and compactions introducing more complexity. Here, pipeline is the most complex since it is accessed by multiple tasks in parallel.

Our guiding principles are:

  1. Correctness. Data retrieval semantics are preserved - in particular, data is never lost.

  2. Performance. Infrequent flushes and compactions, which happen in the background, do not affect the datapath operations, namely scans.

Let us give a quick look at how these principles manifest in the CompactionPipeline design.

Data Structures and Synchronization

Pipeline contains a double-ended queue of ImmutableSegments’s ordered by segment creation time. It is accessed by scans (read) as well as flushes and compactions (update). Since the segments are immutable, it is sufficient to provide the reader with a clone of the queue. One way to go would be to clone upon each scan, under the protection of a reentrant shared lock. We chose a more efficient copy-on-write approach. Namely, only the update operations synchronize on the pipeline. Each update modifies the read-only copy of the queue (volatile reference). The subsequent scans retrieve their clone lock-free. Note that if some segments are removed from the queue by in-memory compaction or disk flush in parallel with an ongoing scan, correctness is not affected because the data does not disappear. Rather, it may be referenced from multiple locations (for instance, both pipeline and snapshot). The scan algorithm filters the duplicates.

In-memory compaction swaps one or more segments in the queue with new (compacted) segments. Similarly to scan, it is a long-running operation, which should not disrupt the concurrent datapath operations. In order to achieve this, we implemented compaction in a non-blocking way. CompactionPipeline maintains a version that is promoted each time the queue tail is modified. When the compaction starts, it records this version. Upon completion, it atomically checks whether the version changed in the meantime, and atomically swaps the segments if it did not. This opportunistic approach succeed in most cases. Since in-memory compaction is an optimization, it is fine for it to fail on rare occasions. The version counter (long) is volatile - that is, changes to it are atomic and immediately observable.

Detailed Scenarios

Scan Operation (in particular, Get). The SegmentScanner’s are created (non-atomically) in the order of data movement between the MemStore segments, to preserve correctness. For example, in the course of scanner set creation a segment can move from active to pipeline, in which case it will be referenced by two scanners - however, no data is lost. The merge algorithm eliminates the redundant results that stem from the overlap.

In-Memory Flush (happens when active overflows). A dedicated worker (1) blocks updates for the region (via RegionServicesForStores), (2) creates a new ImmutableSegment that wraps active, (3) atomically inserts it into pipeline, (4) creates a new MutableSegment and flips the active reference to it, (5) unblocks the updates, and (6) calls MemStoreCompactor.

Disk Flush (happens when the region overflows, and decides to free up space in RAM). A dedicated worker (1) forces in-memory flush (to guarantee there is at least one segment in the pipeline), (2) creates a new CompositeImmutableSegment from all segments in the read-only clone of pipeline and flips the snapshot reference, (3) atomically removes references to segments in snapshot from CompactionPipeline, and (4) scans snapshot (merge across multiple segments) and flushes the results to disk.

In-Memory Compaction (triggered by in-memory flush, except in the disk flush case). (1) Retrieves a versioned copy of pipeline, (2) builds a new (compacted) ImmutableSegment, (3) atomically, if the version did not change, swap one or more segments in pipeline with the new segment (swap target depends on the compaction policy, see below).

Note that all the atomic sections are extremely lightweight. They only include manipulation of a few references, and avoid any computation and copy.

In-Memory Compaction Policies

MemStoreCompactor provides two compaction policies: BASIC and EAGER.

The BASIC policy is a low-cost/low-overhead alternative that merges the indexes of all segments in pipeline into a single flat index. It does not eliminate redundancies, in order to avoid cell data copy. Namely, once the number of segments in pipeline exceeds N, the algorithm scans the CellSet’s of N+1 youngest segments in pipeline, and copies the KeyValue references to a new CellArrayMap. The scan retrieves all the KeyValue’s in the original CellSet’s ordered by key and version (non-SQM matcher).

The EAGER policy is a high-cost/high-reward alternative that both flattens the index and eliminates redundancies across segments. It scans all the segments in pipeline, and merges them into one segment encapsulating a new CellArrayMap index. Redundant data versions are eliminated in the course of scan (SQM matcher). If the MemStore uses MSLAB cell storage, then the data is copied to new (compact) MSLAB’s under the new index. This policy trades extra data copy and GC overhead for maximal memory efficiency.

Disk Flush Policy and WAL Truncation

HBase 2.0 introduces a notion of sloppy MemStore’s - that is, MemStore implementations that dynamically expand and contract their RAM footprint over time. CompactingMemStore is currently the only sloppy MemStore implementation. When a region triggers a flush to disk to free up memory, sloppy stores  are the last candidates for flush. The rationale is that they manage their memory more efficiently than DefaultMemStore by over time, and therefore should be prioritized for remaining in RAM.

Disk flushes trigger WAL truncation (archiving), as the WAL entries corresponding to persisted data versions become obsolete. Region maintains the estimate of the lower bound (minimum sequence id) of non-flushed data among all its stores; the log entries below this bound can be safely removed. Prior to Accordion, this maintenance was simple. Since DefaultMemStore dumps the whole in-memory content to disk, the store-level minimum sequence id was reset when flush was scheduled, and re-installed by the first put operation to occur after the flush.

Since sloppy stores can flush in-memory data to disk partially (for example, CompactingMemStore can flush any suffix of CompactionPipeline) the minimum sequence id maintenance becomes more subtle, to avoid data loss. Namely, every segment maintains its own minimum sequence id, and therefore, the CompactingMemStore lower bound is the minimum among all segments. Note that this is just a conservative estimate. For example, an eager in-memory compaction that happens concurrently to a disk flush might eliminate redundant cells and thereby lift the lower bound. However, this estimate is safe because the value can only monotonously grow over time. It can be safely computed anytime; no atomicity is required while retrieving the segment lower bounds.

If the WAL grows too big despite the truncation efforts, the periodic LogRoller process kicks in and forces a full flush to disk. This generic mechanism guarantees that the recovery after crash does not need to replay the entire history, and also trims the WAL. In other words, however efficient, in-memory compaction does not eliminate disk flushes entirely - rather, it pushes them further into the future. Note that for when EAGER compaction is adopted, periodic flushing is even more important because the WAL stores all the data redundancies that are eliminated by the compaction algorithm.


In this blog post, we covered Accordion’s internals - new classes, relationships, and execution flows. We also zoomed in the synchronization scheme that guarantees thread-safety, and shed light on the compaction policy implementations.

We thank Michael Stack, Anoop Sam John and Ramkrishna Vasudevan for their continuous support that made this project happen.


Inspired by RAMCloud?

Posted by Benjamin on August 02, 2017 at 06:20 AM GMT #

Am looking forward for the windows 10 My computer and this is the learning tutorial that will help http://mycomputerwindows10.com you online.

Posted by ashwin on February 02, 2018 at 05:43 PM GMT #

The segment abstraction, which encapsulates the aggregate of the cellular Set and related metadata time range tracker, MSLAB reference, size counters, etc. beforehand, these gory details had been controlled immediately by the Mem save. The summary section magnificence manages a unmarried cellular Set and its metadata. It has two sub classes: Mutable segment and Immutable phase. http://www.helloanimations.com/explainer-videos-animators

Posted by Anna Angel on May 04, 2018 at 06:42 AM GMT #

Good info.,

Posted by asdf on July 06, 2018 at 05:53 AM GMT #

Is this info current & updated

Posted by Tom on August 31, 2018 at 09:02 PM GMT #

Great article and learning for me! I found a considerable measure of data here! This article is great for all novice here. Are you facing a problem with your Router, extenders, windows, or phone? We are providing solutions for all issue related to technical issues. Are you looking for https://fixingblog.com/how-to-setup-or-reset-the-belkin-range-extender/

Posted by belkin range on September 12, 2018 at 11:57 AM GMT #

One way to move could be to clone upon every scan under the protection of a re entrant shared lock. The shop stage minimum series id was reset whilst flush become scheduled and re installed by means of the first put operation to arise after the flush.

Posted by Help With Course Work on September 19, 2018 at 11:15 AM GMT #

Insightful article and learning opportunities for all! I found a considerable measure of data here! This article is great for any novice here. Are you facing a problem with your cleaning, extenders, windows, or services? We offer pressure solutions for all exterior issues related to technical issues. Are you looking for https://precisionbuildingservices.ca/

Posted by Jim on October 12, 2018 at 08:37 PM GMT #

I have read about this a lot Data Structures and Synchronization many times but didn't understand all the time but today when I read here so I am shocked to see that I know all about it all when I didn't know before read your this topic which is in detailed. There are so many blogs and article I read but understand by your's one.

Posted by HND Assignments Writers on October 13, 2018 at 09:29 AM GMT #

Every developer don't have much information about this Developer View of In-Memory Compaction because development is not very easy to do and everyone can't be expert in each things. It depends on interest I know this because I am also a developer and don't know about this memory compaction that's why it think everyone don't know about it.

Posted by Double Glazing Wealdstone - thelondonglazing on October 14, 2018 at 06:29 PM GMT #

thank you for helping me out this is really an informative blog.

Posted by sanders on December 24, 2018 at 08:12 PM GMT #

Take me back to the mountainside. Under the northern lights, chasing after the stars When we are full of life and the night is calm, there is no fear like

Posted by five nights at freddy's on January 16, 2019 at 05:53 AM GMT #

Good article. Helping me for some questions I have with Apache in my small blog: https://minitatuajes.com/ . But I need some other question. Can I send you some question via private? Many Thanks.

Posted by Sergio Romero on February 14, 2019 at 09:47 AM GMT #

thanks for sharing, this page is very good

Posted by تردد قناة ميكى on February 17, 2019 at 09:47 AM GMT #

Thanks for your support on the data structure guidance. https://essaywritingservice.pk/services/thesis-writing/

Posted by Huma Amna on February 17, 2019 at 07:16 PM GMT #

Post a Comment:
  • HTML Syntax: NOT allowed



Hot Blogs (today's hits)

Tag Cloud