Hbase Performance - 1


Some Key Points  Regarding Performance 

Point-1 
 HBase  Is a Column-oriented datastore and the tables in it are sorted by row. Is designed to store Denormalized Data 
Stored as 
<Key>  -- <Having Many Columns>{ Each cell value of the table has a timestamp}

Point - 2  (Hbase Purpose)
if you are scanning multiple rows or whole table you might be Using Hbase for wrong purpose, Please change your as per your requirement

Point -3  (RowKey Design)
RowKey design should chosen in that case in which you can apply filter and query the subset of RowKeys. if you are not using any filter your RowKey Design may be wrong

Point -4  (Column Family)
Hbase will face performance issues more than 2 or 3 Column families
if you want to keep more column family, you can do this by using columns with deifferent prefixes 
 HBase is sparse so it won't take more space and you can still get just one "family" with a columnPrefix filter on scans if you need to

Hbase is not sparse means when a row is created storage for each columns will not be allocated unless that column has values


Point -5 (Versions)
In Hbase a table which contains 30 column but have a single column family
  1. create 'my_table', { NAME => 'my_family', VERSIONS => 5 }
want to increase the version to 10,000
  1. create 'my_table', { NAME => 'my_family', VERSIONS => 10000 }
when change the version to 10K it will be changed to all columns but can requirement is only to change for 2 column
what will be the performance impact in both cases
  1. make the two different column family and change version accordingly
  2. Changed version for all column

it will be good creating separate column family preserving unnecessary version for other 28 column will adversely affect the performance since the size of Hstore file is increased Increased in the size of Hbase data will increase the number of regions that will increase the Number of mappers per region server
so by creating the two column family store file size will not be storing the unnecessary data, help in less split during compaction. IO performance will be improved
if there are two column family A and B and cardinality of A is 1million and B is 1Billion, Data of A is spread across many regions and regions server.This makes mass scans for ColumnFamilyA less efficient.
regions are distributed as per the rowkey, so even if A has 1 million rows and has a good distribution across rowkeys. then yes you may need to scan all those regions. I don't think that will impact much but this can only be avoided by using different table for these two high versioned columns.


Comments