Entries tagged [performance]

Saturday April 23, 2016

HDFS HSM and HBase: Conclusions (Part 7 of 7)

This is part 7 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel)  

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

Conclusions

There are many things to consider when choosing the hardware of a cluster. According to the test results, in the SSD-related cases the network utility between DataNodes is larger than 10Gbps. If you are using a 10Gbps switch, the network will be the bottleneck and impact the performance. We suggest either extending the network bandwidth by network bonding, or upgrading to a more powerful switch with a higher bandwidth. In cases 1T_HDD and 1T_RAM_HDD, the network utility is lower than 10 Gbps in most time, using a 10 Gbps switch to connect DataNodes is fine.

In all 1T dataset tests, 1T_RAM_SSD shows the best performance. Appropriate mix of different types of storage can improve the HBase write performance. First, write the latency-sensitive and blocking data to faster storage, and write the data that are rarely compacted and accessed to slower storage. Second, avoid mixing types of storage with a large performance gap, such as with 1T_RAM_HDD.

The hardware design issue limits the total disk bandwidth which makes there is hardly superiority of eight SSDs than four SSDs. Either to enhance hardware by using HBA cards to eliminate the limitation of the design issue for eight SSDs or to mix the storage appropriately. According to the test results, in order to achieve a better balance between performance and cost, using four SSDs and four HDDs can achieve a good performance (102% throughput and 101% latency of eight SSDs) with a much lower price. The RAMDISK/SSD tiered storage is the winner of both throughput and latency among all the tests, so if cost is not an issue and maximum performance is needed, RAMDISK(extremely high speed block device, e.g. NVMe PCI-E SSD)/SSD should be chosen.

You should not use a large number of flusher/compactor when most of data are written to HDD. The read and write shares the single channel per HDD, too many flushers and compactors at the same time can slow down the HDD performance.

During the tests, we found some things that can be improved in both HBase and HDFS.

In HBase, the memstore is consumed quickly when the WALs are stored in fast storage; this can lead to regular long GC pauses. It is better to have an offheap memstore for HBase.

In HDFS, each DataNode shares the same lock when creating/finalizing blocks. Any such slow operations in one DataXceiver can block any other operations of creating/finalizing blocks in other DataXceiver on the same DataNode no matter what storage they are using. We need to eliminate the blocking access across storage, and a finer grained lock mechanism to isolate the operations on different blocks is needed (HDFS-9668). And it will be good to implement a latency-aware VolumeChoosingPolicy in HDFS to remove the slow volumes from the candidates.

RoundRobinVolumeChoosingPolicy can lead to load imbalance in HDFS with tiered storage (HDFS-9608).

In HDFS, renaming a file to a different storage does not move the blocks indeed. We need to asynchronously move the HDFS blocks in such a case.

Acknowledgements

The authors would like to thank Weihua Jiang  – who is the previous manager of the big data team in Intel – for leading this performance evaluation, and thank Anoop Sam John(Intel), Apekshit Sharma(Cloudera), Jonathan Hsieh(Cloudera), Michael Stack(Cloudera), Ramkrishna S. Vasudevan(Intel), Sean Busbey(Cloudera) and Uma Gangumalla(Intel) for the nice review and guidance.

HDFS HSM and HBase: Issues (Part 6 of 7)

This is part 6 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel)

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

Issues and Improvements

This section describes the issues we find during tests.

Software

Long-time BLOCKED threads in DataNode

In 1T_RAM_HDD test, we can observe a substantial 0 throughput period in the YCSB client. After deep-diving into the thread stack, we find many threads of DataXceiver are stuck in a BLOCKED state for a long time in DataNode. We can observe such things in other cases too, but it is most often in 1T_RAM_HDD.

In each DataNode, there is a single instance of FsDatasetImpl where there are many synchronized methods, DataXceiver threads use this instance to achieve synchronization when creating/finalizing blocks. A slow creating/finalizing operation in one DataXceiver thread can block other creating/finalizing operations in all other DataXceiver threads. The following table shows the time consumed by these operations in 1T_RAM_HDD:

Synchronized methods

Max exec time (ms)

in light load

Max wait time (ms)

in light load

Max exec time (ms)

in heavy load

Max wait time (ms)

in heavy load

finalizeBlock

0.0493

0.1397

17652.9761

21249.0423

addBlock

0.0089

0.4238

15685.7575

57420.0587

createRbw

0.0480

1.7422

21145.8033

56918.6570

Table 13. DataXceiver threads and time consumed

We can see that both execution time and wait time on the synchronized methods increased dramatically along with the increment of the system load. The time to wait for locks can be up to tens of seconds. Slow operations usually come from the slow storage such as HDD. It can hurt the concurrent operations of creating/finalizing blocks in fast storage so that HDFS/HBase cannot make better use of tiered storage.

A finer grained lock mechanism in DataNode is needed to fix this issue. We are working on this improvement now (HDFS-9668).

Load Imbalance in HDFS with Tiered Storage

In tiered storage cases, we find that the utilization are not the same among volumes of the same storage type when using the policy RoundRobinVolumeChoosingPolicy. The root cause is that in RoundRobinVolumeChoosingPolicy it uses a shared counter to choose volumes for all storage types. It might be unfair when choosing volumes for a certain storage type, so the volumes in the tail of the configured data directories have a lower chance to be written.

The situation becomes even worse when there are different numbers of volumes of different storage types. We have filed this issue in JIRA (HDFS-9608) and provided a patch.

Asynchronous File Movement Across Storage When Renaming in HDFS

Currently data blocks in HDFS are stored in different types of storage media according to pre-specified storage policies when creating. After that data blocks will remain where they were until an external tool Mover in HDFS is used. Mover scans the whole namespace, and moves the data blocks that are not stored in the right storage media as the policy specifies.

In a tiered storage, when we rename a file/directory from one storage to another different one, we have to move the blocks of that file or all files under that directory to the right storage. This is not currently provided in HDFS.

Non-configurable Storage Type and Policy

Currently in HDFS both storage type and storage policy are predefined in source code. This makes it inconvenient to add implementations for new devices and policies. It is better to make them configurable.

No Optimization for Certain Storage Type

Currently there is no difference in the execution path for different storage types. As more and more high performance storage devices are adopted, the performance gap between storage types will become larger, and the optimization for certain types of storage will be needed.

Take writing certain numbers of data into HDFS as an example. If users want to minimize the total time to write, the optimal way for HDD may be using compression to save disk I/O, while for RAMDISK writing directly is more suitable as it eliminates the overheads of compression. This scenario requires configurations per storage type, but it is not supported in the current implementation.

Hardware

Disk Bandwidth Limitation

In the section 50GB Dataset in a Single Storage, the performance difference between four SSDs and eight SSDs is very small. The root cause is the total bandwidth available for the eight SSDs is limited by upper level hardware controllers. Figure 16 illustrates the motherboard design. The eight disk slots connect to two different SATA controllers (Ports 0:3 - SATA and Port 0:3 - sSATA). As highlighted in the red rectangle, the maximum bandwidth available for the two controller in the server is 2*6 Gbps = 1536 MB/s.

Figure 16. Hardware design of server motherboard

Maximum throughput for single disk is measured with FIO tool.


Read BW (MB/s)

Write BW (MB/s)

HDD

140

127

SSD

481

447

RAMDISK

>12059

>11194

Table 14. Maximum throughput of storage medias

Note: RAMDISK is memory essentially and it does not go through the same controller as SSDs and HDDs, so it does not have the 2*6Gbps limitation. Data of RAMDISK is listed in the table for convenience of comparison.

According to Table 14 the writing bandwidth of eight SSDs is 447 x 8 = 3576 MB/s. It exceeds the controllers’ 1536 MB/s physical limitation, thus only 1536 MB/s are available for all eight SSDs. HDD is not affected by this limitation as their total bandwidth (127 x 8 = 1016 MB/s) is below the limitation. This fact greatly impacts the performance of the storage system.

We suggest one of the following:

  • Enhance hardware by using more HBA cards to eliminate the limitation of the design issue.

  • Use SSD and HDD together with an appropriate ratio (for example four SSDs and four HDDs) to achieve a better balance between performance and cost.

Disk I/O Bandwidth and Latency Varies for Ports

As described in section Disk Bandwidth Limitation, four of the eight disks connect to Ports 0:3 - SATA and the rest of them connect to Port 0:3 - sSATA, the total bandwidth of the two controllers is 12Gbps. We find that the bandwidth is not evenly divided to the disk channels.

We do a test, each SSD (sda, sdb, sdc and sdd connect to Port 0:3 - sSATA, sde, sdf, sdg and sdh connect to Ports 0:3 - SATA) is written by an individual FIO process. It’s expected that eight disks are written at the same speed, but according to the output of IOSTAT the bandwidth 1536 MB/s is not evenly divided to two controllers and eight disk channels. As shown in Figure 17, the four SSDs connected to Ports 0:3 - SATA obtain more I/O bandwidth (213.5MB/s*4) than the others (107MB/s*4).

We suggest that you consider the controller limitation and storage bandwidth when setting up a cluster. Using four SSDs and four HDDs in a node is a reasonable choice, and it is better to install the four SSDs to Port 0:3 - SATA.

Additionally, the disk with higher latency might take the same workload as the disks with lower latency in the existing VolumeChoosingPolicy. This would slow down the performance. We suggest to implement a latency-aware VolumeChoosingPolicy in HDFS.

Figure 17. Different write speed and await time of disks

Go to part 7, Conclusions

HDFS HSM and HBase: Experiment (continued) (Part 5 of 7)

This is part 5 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel)

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

1TB Dataset in a Single Storage

The performance for 1TB dataset in HDD and SSD is shown in Figure 6 and Figure 7. Due to the limitation of memory capability, 1TB dataset in RAMDISK is not tested.

Figure 6. YCSB throughput of a single storage type with 1TB dataset

Figure 7. YCSB latency of a single storage type with 1TB dataset

The throughput and latency on SSD are both better than HDD (134% throughput and 35% latency). This is consistent with 50GB data test.

The benefits gained for throughput by using SSD are different between 50GB and 1TB (from 128% to 134%), SSD gains more benefits in the 1TB test. This is because much more I/O intensive events such as compactions occur in 1TB dataset test than 50GB, and this shows the superiority of SSD in huge data scenarios. Figure 8 shows the changes of the network throughput during the tests.

Figure 8. Network throughput measured for case 1T_HDD and 1T_SSD

In 1T_HDD case the network throughput is lower than 10Gbps, and in 1T_SSD case the network throughput can be much larger than 10Gbps. This means if we use a 10Gbps switch in 1T_SSD case, the network should be the bottleneck.

Figure 9. Disk throughput measured for case 1T_HDD and 1T_SSD

In Figure 9, we can see the bottleneck for these two cases is disk bandwidth.

  • In 1T_HDD, at the beginning of the test the throughput is almost 1000 MB/s, but after a while the throughput drops down due to memstore limitation of regions caused by slow flush.

  • In 1T_SSD case, the throughput seems to be limited by a ceiling of around 1300 MB/s, nearly the same with the bandwidth limitation of SATA controllers. To further improve the throughput, more SATA controllers are needed (e.g. using HBA card) instead of more SSDs are needed.

During 1T_SSD test, we observe that the operation latencies on eight SSDs per node are very different as shown in the following chart. In Figure 10, we only include latency of two disks, sdb represents disks with a high latency and sdf represents disks with a low latency.

Figure 10. I/O await time measured for different disks

Four of them have a better latency than the other ones. This is caused by the hardware design issue. You can find the details in Disk I/O Bandwidth and Latency Varies for Ports. The disk with higher latency might take the same workload as the disks with lower latency in the existing VolumeChoosingPolicy, this would slow down the performance. We suggest to implement a latency-aware VolumeChoosingPolicy in HDFS.

Performance Estimation for RAMDISK with 1TB Dataset

We cannot measure the performance of RAMDISK with 1T dataset due to RAMDISK limited capacity. Instead we have to evaluate its performance by analyzing the results of cases HDD and SSD.

The performance between 1TB and 50GB dataset are pretty close in HDD and SSD.

The throughput difference between 50GB and 1TB dataset for HDD is

|242801250034-1|×100%=2.89%

While for SSD the value is

|325148320616-1|×100%=1.41%

If we make an average of the above values as the throughput decrease in RAMDISK between 50GB and 1TB dataset, it is around 2.15% ((2.89%+1.41%)/2=2.15%), thus the throughput for RAMDISK with 1T dataset should be

406577×(1+2.15%)=415318 (ops/sec)


Figure 11.  YCSB throughput estimation for RAMDISK with 1TB dataset

Please note: the throughput doesn’t drop much in 1 TB dataset cases compared to 50 GB dataset cases because they do not use the same number of pre-split regions. The table is pre-split to 18 regions in 50 GB dataset cases, and it is pre-split to 210 regions in the 1 TB dataset.

Performance for Tiered Storage

In this section, we will study the HBase write performance on tiered storage (i.e. different storage mixed together in one test). This would show what performance it can achieve by mixing fast and slow storage together, and help us to conclude the best balance of storage between performance and cost.

Figure 12 and Figure 13 show the performance for tiered storage. You can find the description of each case in Table 1.

Most of the cases that introduce fast storage have better throughput and latency. With no surprise, 1T_RAM_SSD has the best performance among them. The real surprise is that the throughput of 1T_RAM_HDD is worse than 1T_HDD (-11%) and 1T_RAM_SSD_All_HDD is worse than 1T_SSD_All_HDD (-2%) after introducing RAMDISK, and 1T_SSD is worse than 1T_SSD_HDD (-2%).

Figure 12.  YCSB throughput data for tiered storage

Figure 13.  YCSB latency data for tiered storage

We also investigate how much data is written to different storage types by collecting information from one DataNode.

Figure 14. Distribution of data blocks on each storage of HDFS in one DataNode

As shown in Figure 14, generally, more data are written to disks for test cases with higher throughput. Fast storage can accelerate the flush and compaction, which lead to more flushes and compactions.  Thus, more data are written to disks. In some RAMDISK-related cases, only WAL can be written to RAMDISK, and there are 1216 GB WALs written to one DataNode.

For tests without SSD (1T_HDD and 1T_RAM_HDD), we by purpose limiting the number of flush and compaction actions by using fewer flushers and compactors. This is due to limited IOPs capability of HDD, which lead to fewer flush & compactions. Too many concurrent reads and writes can hurt HDD performance which eventually slows down the performance.

Many BLOCKED DataNode threads can be blocked up to tens of seconds in 1T_RAM_HDD. We observe this in other cases as well, but it happens most often in 1T_RAM_HDD. This is because each DataNode holds one big lock when creating/finalizing HDFS blocks, these methods might take tens of seconds sometimes (see Long-time BLOCKED threads in DataNode), the more these methods are used (in HBase they are used in flusher, compactor, and WAL), the more often the BLOCKED occurs. Writing WAL in HBase needs to create/finalize blocks which can be blocked, and consequently users’ inputs are blocked. Multiple WAL with a large number of groups or WAL per region might also encounter this problem, especially in HDD.

With the written data distribution in mind, let’s look back at the performance result in Figure 12 and Figure 13. According to them, we have following observations:

  1. Mixing SSD and HDD can greatly improve the performance (136% throughput and 35% latency) compared to pure HDD. But fully replacing HDD with SSD doesn’t show an improvement (98% throughput and 99% latency) over mixing SSD/HDD. This is because the hardware design cannot evenly split the I/O bandwidth to all eight disks, and 94% data are written in SSD while only 6% data are written to HDD in SSD/HDD mixing case. This strongly hints a mix use of SSD/HDD can achieve the best balance between performance and cost. More information is in Disk Bandwidth Limitation and Disk I/O Bandwidth and Latency Varies for Ports.

  2. Including RAMDISK in SSD/HDD tiered storage has different results with 1T_RAM_SSD_All_HDD and 1T_RAM_SSD_HDD. The case 1T_RAM_SSD_HDD shows a result when there are only a few data written to HDD, which improves the performance over SSD/HDD mixing cases. The results of 1T_RAM_SSD_All_HDD when there are a large number of data written to HDD is worse than SSD/HDD mixing cases. This means if we distribute the data appropriately to SSD and HDD in HBase, we can gain a good performance when mixing RAMDISK/SSD/HDD.

  3. The RAMDISK/SSD tiered storage is the winner of both throughput and latency (109% throughput and 67% latency of pure SSD case). So, if cost is not an issue and maximum performance is needed, RAMDISK/SSD should be chosen.

The throughput decreases by 11% by comparing 1T_RAM_HDD to 1T_HDD. This is initially because 1T_RAM_HDD uses RAMDISK which consumes part of the RAM, which results in the OS buffer having less memory to cache the data.

Further, with 1T_RAM_HDD, the YCSB client can push data at very high speed, cells are accumulated very fast in memstore while the flush and compaction in HDD are slow, the RegionTooBusyException occurs more often (the figure below shows a much larger memstore in 1T_RAM_HDD than 1T_HDD), and we observe much longer GC pause in 1T_RAM_HDD than 1T_HDD, it can be up to 20 seconds in a minute.

Figure 15. Memstore size in 1T_RAM_HDD and 1T_HDD

Finally, as we try to increase the number of flushers and compactors, the performance even goes worse because of the reasons mentioned when explaining why we use less flusher and compactors in HDD-related tests (see Long-time BLOCKED threads in DataNode).

The performance reduction in 1T_RAM_SSD_All_HDD than 1T_SSD_All_HDD (-2%) is due to the same reasons mentioned above.

We suggest:

Implement a finer grained lock mechanism in DataNode.

  1. Use reasonable configurations for flusher and compactor, especially in HDD-related cases.

  2. Don’t use the storage that has large performance gaps, such as directly mixing RAMDISK and HDD together.

  3. In many cases, we can observe the long GC pause around 10 seconds per minute. We need to implement an off-heap memstore in HBase to solve long GC pause issues.

  4. Implement a finer grained lock mechanism in DataNode.


Go to part 6, Issues

HDFS HSM and HBase: Experiment (Part 4 of 7)

This is part 4 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel) 

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

Experimentation

Performance for Single Storage Type

First, we will study the HBase write performance on each single storage type (i.e. no storage type mix). This test is to show the maximum performance each storage type can achieve and provide a guide to following hierarchy storage analysis.

We study the single-storage-type performance with two data sets: 50GB and 1TB data insertion. We believe 1TB is a reasonable data size in practice. And 50GB is used to evaluate the performance uplimit as HBase performance is typically higher when the data size is small. The 50GB size was chosen because we need to avoid data out of space in test. Further, due to RAMDISK limited capacity, we have to use a small data size when storing all data in RAMDISK.

50GB Dataset in a Single Storage

The throughput and latency by YCSB for 50GB dataset are listed in Figure 2 and Figure 3. As expected, storing 50GB data in RAMDISK has the best performance whereas storing data in HDD has the worst.

Figure 2. YCSB throughput of a single storage with 50GB dataset

Note: Performance data for SSD and HDD may be better than its real capability as OS can buffer/cache data in memory.

Figure 3. YCSB latency of a single storage with 50GB dataset

Note: Performance data for SSD and HDD may be better than its real capability as OS can buffer/cache data in memory.

For throughput, RAMDISK and SSD are higher than HDD (163% and 128% of HDD throughput, respectively). But the average latencies are dramatically lower than HDD (only 11% and 32% of HDD latency). This is expected as HDD has long latency on seek operation. The latency improvements of using RAMDISK, SSD over HDD are bigger than throughput improvement. This is caused by the huge access latency gap of different storage. The latency we measured of accessing 4KB raw data from RAM, SSD and HDD are respectively 89ns, 80µs and 6.7ms (RAM and SSD are about 75000x and 84x faster than HDD).

Now consider the data of hardware (disk, network and CPU in each DataNode) collected during the tests.

Disk Throughput


Throughput theoretically (MB/s)

Throughput measured (MB/s)

Utilization

50G_HDD

127 x 8 = 1016

870

86%

50G_SSD

447 x 8 = 3576

1300

36%

50G_RAM

>11194

1800

<16%

Table 10. Disk utilization of test cases

Note: The data listed in the table for 50G_HDD and 50G_SSD are the total throughput of 8 disks.

The disk throughput is decided by factors such as access model, block size and I/O depth. The theoretical disk throughput is measured by using performance-friendly factors; in real cases they won’t be that friendly and would limit the data to lower values. In fact we observe that the disk utilization usually goes up to 100% for HDD.

Network Throughput

Each DataNode is connected by a 20Gbps (2560MB/s) full duplex network, which means both receive and transmit speed can reach 2560MB/s simultaneously. In our tests, the receive and transmit throughput are almost identical, so only the receive throughput data is listed in Table 9.


Throughput theoretically (MB/s)

Throughput measured (MB/s)

Utilization

50G_HDD

2560

760

30%

50G_SSD

2560

1100

43%

50G_RAM

2560

1400

55%

Table 11. Network utilization of test cases

CPU Utilization


Utilization

50G_HDD

36%

50G_SSD

55%

50G_RAM

60%

Table 12. CPU utilization of test cases

Clearly, the performance on HDD (50G_HDD) is bottlenecked on disk throughput. However, we can see that, for SSD and RAMDISK, neither disk, network nor CPU are the bottleneck.  So, the bottlenecks must be somewhere else. They can be somewhere in software (e.g. HBase compaction, memstore in regions, etc) or hardware (except disk, network and CPU).

To further understand why the utilization of SSD is so low and throughput is not as high as expected —only 132% of HDD, considering SSD has much higher theoretical throughput (447 MB/s vs. 127MB/s per disk) — we make an additional test to write 50GB data on four SSDs per node instead of eight SSDs per node.

Figure 4. YCSB throughput of a single storage type with 50 GB dataset stored in different number of SSDs

Figure 5. YCSB latency of a single storage type with 50 GB dataset stored in different number of SSDs

We can see that while the number of SSDs doubled, the throughput and latency of eight SSDs per node case (50G_8SSD) are improved to a much lesser extent (104% throughput and 79% latency) compared to four SSDs per node case (50G_4SSD). This means that the ability of SSDs are far from full use.

We made further dive and found that it is caused by current mainstream server hardware design. Currently, mainstream server has two SATA controllers which can support up to 12 Gbps bandwidth. This means that the total disk bandwidth is limited to around 1.5GB/s. That is the bandwidth of approximately 3.4 SSDs. You can find the details in the section Disk Bandwidth Limitation. So SATA controllers have become the bottleneck for eight SSDs per node. This explains why there is almost no improvement on YCSB throughput for eight SSDs per node. In the 50G_8SSD test, the disk is the bottleneck.

Go to part 5, Experiment (continued)

HDFS HSM and HBase: Cluster setup (Part 2 of 7)

This is part 2 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel)

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

Cluster Setup

In all, five nodes are used in the testing. Figure 1 shows the topology of these nodes; the services on each node are listed in Table 3.

Figure 1. Cluster topology

  1. For HDFS, one node serves as NameNode, three nodes as DataNodes.

  2. For HBase, one HMaster is collocated together with NameNode. Three RegionServers are collocated with DataNodes.

  3. All the nodes in the cluster are connected to the full duplex 10Gbps DELL N4032 Switch. Network bandwidth for client-NameNode, client-DataNode and NameNode-DataNode is 10Gbps. Network bandwidth between DataNodes is 20Gbps (20Gbps is achieved by network bonding).

Node

NameNode

DataNode

HMaster

RegionServer

Zookeeper

YCSB client

1





2




3




4




5






Table 3. Services run on each node

Hardware

Hardware configurations of the nodes in the cluster is listed in the following tables.

Item

Model/Comment

CPU

Intel® Xeon® CPU E5-2695 v3 @ 2.3GHz, dual sockets

Memory

Micron 16GB DDR3-2133MHz, 384GB in total

NIC

Intel 10-Gigabit X540-AT2

SSD

Intel S3500 800G

HDD

Seagate Constellation™ ES ST2000NM0011 2T 7200RPM

RAMDISK

300GB

Table 4. Hardware for DataNode/RegionServer

Note: OS is stored on an independent SSD (Intel® SSD DC S3500 240GB) for both DataNodes and NameNode. The number of SSD or HDD (OS SSD not included) in DataNode varies for different testing cases., See Section ‘Methodology’ for details.

Item

Model/Comment

CPU

Intel® Xeon® CPU E5-2697 v2 @ 2.70GHz, dual sockets

Memory

Micron 16GB DDR3-2133MHz, 260GB in total

NIC

Intel 10-Gigabit X540-AT2

SSD

Intel S3500 800G

Table 5. Hardware for NameNode/HBase MASTER


Item

Value

Intel Hyper-Threading Tech

On

Intel Virtualization

Disabled

Intel Turbo Boost Technology

Enabled

Energy Efficient Turbo

Enabled

Table 6. Processor configuration

Software

Software

Details

OS

CentOS release 6.5

Kernel

2.6.32-431.el6.x86_64

Hadoop

2.6.0

HBase

1.0.0

Zookeeper

3.4.5

YCSB

0.3.0

JDK

jdk1.8.0_60

JVM Heap

NameNode:      32GB

DataNode:          4GB

HMaster:            4GB

RegionServer:  64GB

GC

G1GC

Table 7. Software stack version and configuration

NOTE: as mentioned in Methodology, HDFS and HBase have been enhanced to support this test

Benchmarks

We use YCSB 0.3.0 as the benchmark and use one YCSB client in the tests.

This is the workload configuration for 1T dataset:

# cat ../workloads/1T_workload

fieldcount=5

fieldlength=200

recordcount=1000000000

operationcount=0

workload=com.yahoo.ycsb.workloads.CoreWorkload

readallfields=true

readproportion=0

updateproportion=0

scanproportion=0

insertproportion=0

requestdistribution=zipfian

And we use following command to start the YCSB client:

./ycsb load hbase-10 -P ../workloads/1T_workload -threads 200 -p columnfamily=family -p clientbuffering=true -s > 1T_workload.dat

Go to part 3, Tuning

HDFS HSM and HBase: Introduction (Part 1 of 7)

This is part 1 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel)

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

Introduction

As more and more fast storage types (SSD, NVMe SSD, etc.) emerge, a methodology is necessary for better throughput and latency when using big data. However, these fast storage types are still expensive and are capacity limited. This study provides a guide for cluster setup with different storage media.

In general, this guide considers the following questions:

  1. What is the maximum performance user can achieve by using fast storage?

  2. Where are the bottlenecks?

  3. How to achieve the best balance between performance and cost?

  4. How to predict what kind of performance a cluster can have with different storage combinations?


In this study, we study the HBase write performance on different storage media. We leverage the hierarchy storage management support in HDFS to store different categories of HBase data on different media.

Three different types of storage (HDD, SSD and RAMDISK) are evaluated. HDD is the most popular storage in current usages, SATA SSD is a faster storage which is more and more popular now. RAMDISK is used to emulate the extremely high performance PCI-e NVMe based SSDs and coming faster storage (e.g. Intel 3D XPoint® based SSD). Due to hardware unavailability, we have to use RAMDISK to perform this emulation. And we believe our results hold for PCI-e SSD and other fast storage types.

Note: RAMDISK is logical device emulated with memory. Files stored into RAMDISK will only be cached in memory.

Methodology

We test the write performance in HBase with a tiered storage in HDFS and compare the performance when storing different HBase data into different storages. YCSB (Yahoo! Cloud Serving Benchmark, a widely used open source framework for evaluating the performance of data-serving systems) is used as the test workload.

Eleven test cases are evaluated in this study.  We split the table into 210 regions in 1 TB dataset cases to avoid region split at runtime, and we pre-split the table into 18 regions in 50 GB dataset cases.

The format of case names is <dataset size>_<storage type>.

Case Name

Storage

Dataset

Comment

50G_RAM

RAMDISK

50GB

Store all the files in RAMDISK. We have to limit data size to 50GB in this case due to the capacity limitation of RAMDISK.

50G_SSD

8 SSDs

50GB

Store all the files in SATA SSD. Compare the performance by 50GB data with the 1st case.

50G_HDD

8 HDDs

50GB

Store all the files in HDD.

1T_SSD

8 SSDs

1TB

Store all files in SATA SSD. Compare the performance by 1TB data with cases in tiered storage.

1T_HDD

8 HDDs

1TB

Store all files in HDD. Use 1TB in this case.

1T_RAM_SSD

RAMDISK

8 SSDs

1TB

Store files in a tiered storage (i.e. different storage mixed together in one test), WAL is stored in RAMDISK, and all the other files are stored in SSD.

1T_RAM_HDD

RAMDISK

8 HDDs

1TB

Store files in a tiered storage, WAL is stored in RAMDISK, and all the other files are stored in HDD.

1T_SSD_HDD

4 SSDs

4 HDDs

1TB

Store files in a tiered storage, WAL is stored in SSD, some smaller files (not larger than 1.5GB) are stored in SSD, and all the other files are stored in HDD including all archived files.

1T_SSD_All_HDD

4 SSDs

4 HDDs

1TB

Store files in a tiered storage, WAL and flushed HFiles are stored in SSD, and all the other files are stored in HDD including all archived files and compacted files.

1T_RAM_SSD_HDD

RAMDISK

4 SSDs

4 HDDs

1TB

Store files in a tiered storage, WAL is stored in RAMDISK, some smaller files (not larger than 1.5GB) are stored in SSD, and all the other files are stored in HDD including all archived files.

1T_RAM_SSD_All_HDD

RAMDISK

4 SSDs

4 HDDs

1TB

Store files in a tiered storage, WAL is stored in RAMDISK, flushed HFiles are stored in SSD, and all the other files are stored in HDD including all archived files and compacted files.

Table 1. Test Cases

NOTE: In all 1TB test cases, we pre-split the HBase table into 210 regions to avoid the region split at runtime.

The metrics in Table 2 are collected during the test for performance evaluation.

Metrics

Comment

Storage media level

IOPS, throughput (sequential/random R/W)

OS level

CPU usage, network IO, disk IO, memory usage

YCSB Benchmark level

Throughput, latency

Table 2. Metrics collected during the tests

Go to Part 2, Cluster Setup

Friday April 22, 2016

HDFS HSM and HBase: Tuning (Part 3 of 7)

This is part 3 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel) 

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

Stack Enhancement and Parameter Tuning

Stack Enhancement

To perform the study, we made a set of enhancements in the software stack:

  • HDFS:

    • Support a new storage RAMDISK

    • Add file level mover support, a user can move blocks per file without scanning all metadata in NameNode

  • HBase:

    • WAL, flushed HFiles, HFiles generated in compactions, and archived HFiles can be stored in different storage

    • When renaming HFiles across storage, the blocks of that file would be moved to the target storage asynchronously

HDFS/HBase Tuning

This step is to find the best configurations for HDFS and HBase.

Known Key Performance Factors in HBase

These are the key performance factors in HBase:

  1. WAL: write ahead log to guarantee the non-volatility and consistency of the data. Each record that is inserted to HBase must be written to WAL which can slow down user operations. It’s latency-sensitive.

  2. Memstore and Flush: The records inserted into HBase are cached in memstore, and when reaches a threshold the memstore is flushed to a store file. Slow flush can lead to high GC (Garbage Collection) pause, and make memory usage reach the thresholds in regions and region server, which can block the user operations.

  3. Compaction and Number of Store Files: HBase compaction compacts small store files to a larger one which can reduce the number of store files and accelerate the reading, but it can generate heavy I/O and consume the disk bandwidth in runtime. Less compaction can accelerate the writing but generates too many store files, which slow down the reading. When there are too many store files, the memstore flush can be slowed down which can lead to a large memstore and further slow the user operations.

Based on this understanding, the following are the tuned parameters we finally used.

Property

Value

dfs.datanode.handler.count

64

dfs.namenode.handler.count

100

Table 8. HDFS configuration

Property

Value

hbase.regionserver.thread.compaction.small

3 for non-SSD test cases.

8 for all SSD related test cases.

hbase.hstore.flusher.count

5 for non-SSD test cases.

15 for all SSD related test cases.

hbase.wal.regiongrouping.numgroups

4

hbase.wal.provider

multiwal

hbase.hstore.blockingStoreFiles

15

hbase.regionserver.handler.count

200

hbase.hregion.memstore.chunkpool.maxsize

1

hbase.hregion.memstore.chunkpool.initialsize

0.5

Table 9. HBase configuration

Go to part 4, Experiment

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation