Apache Hbase Region Splitting and Merging

Hbase Table -

  • These table are stored in number of regions,  Table split into chunk of rows called regions
  • Those regions are distributed across the cluster, regions are made accessible to client process by the region server process
  • Region is continuous range within key Space, means all the rows in the region server are sorted between the region start key and region end key 
  • Regions are non overlapping single, single key is present in a single regions
  • Together with the -ROOT- and .META. regions, a table’s regions effectively form a 3 level B-Tree for the purposes of locating a row within a table
  • Regions in turns consist of Stores, which correspond  to the column family 
A Store Consist of one memstore and  zero or more store files.
data for each column family are stored and access saperately

Regions  [split  [[[  STORE  -- [ .memstore , various store file] ]]]] ],    -ROOT- and .META.  


Pre Splitting - 

In the beginning of table creation hbase creates only single region, since it has no idea how to create the split. This decision is highly based on the key distribution in the data. HBASE don't provide this to manage this with client
With the process called pre-splitting  you can create the table with the multiple split by providing the split point at the table creation time. This can only be done if you have key distribution before hand.

For pre-splitting you can use the Region splitter utility  with the pluggable split algo.
this utility uses 2 algorithm [Hex string split , Uniform Split ]  , Hex used for hexadecimal string eg if you are using hashes as prefix 


Auto Spliting 

when a region reaches up to certain limit it is automatically get split into two regions
you can configured regarding when to split a region with a pluggable API RegionSplitPolicy 
There are various region split policy 
  1. ConstantSizeRegionSplitPolicy - default before 0.94 split when total  data size of one of the stores  rise to certain limit default is 10Gb "hbase.hregion.max.filesize"
  2. IncreasingToUpperBoundRegionSplitPolicy   - default in 0.94 this policy uses the max size of store file based on minimum of                                                                                             Min (R^2 * “hbase.hregion.memstore.flush.size”, “hbase.hregion.max.filesize”)                       --where R is the number of region of the same table in the same region server
  3. KeyPrefixRegionSplitPolicy -- based on key prefixes

Forced Split

hbase(main):024:0> split 'b07d0034cbe72cb040ae9cf66300a10c', 'b'


Region Server Implemented 

when the write request is send to server it get accumulated in the in memory storage called memestorage. when the memstorage spills the content get written to a store file this is called memstorage flush.
Region server compact the store file in the bigger file, and compact the data. after this  -
--Flush --> compact -->region split request is enqueue  to the region server

Comments