Hbase Performance

Some Key Points Regarding Performance

Point-1

HBase Is a Column-oriented datastore and the tables in it are sorted by row. Is designed to store Denormalized Data

Stored as

<Key> -- <Having Many Columns>{ Each cell value of the table has a timestamp}

Point - 2 (Hbase Purpose)

if you are scanning multiple rows or whole table you might be Using Hbase for wrong purpose, Please change your as per your requirement

Point -3 (RowKey Design)

RowKey design should chosen in that case in which you can apply filter and query the subset of RowKeys. if you are not using any filter your RowKey Design may be wrong

Point -4 (Column Family)

Hbase will face performance issues more than 2 or 3 Column families

if you want to keep more column family, you can do this by using columns with deifferent prefixes

HBase is sparse so it won't take more space and you can still get just one "family" with a columnPrefix filter on scans if you need to

Hbase is not sparse means when a row is created storage for each columns will not be allocated unless that column has values

Point -5 (Versions)

In Hbase a table which contains 30 column but have a single column family


create 'my_table', { NAME => 'my_family', VERSIONS => 5 }

want to increase the version to 10,000


create 'my_table', { NAME => 'my_family', VERSIONS => 10000 }

when change the version to 10K it will be changed to all columns but can requirement is only to change for 2 column

what will be the performance impact in both cases

make the two different column family and change version accordingly
Changed version for all column

it will be good creating separate column family preserving unnecessary version for other 28 column will adversely affect the performance since the size of Hstore file is increased Increased in the size of Hbase data will increase the number of regions that will increase the Number of mappers per region server

so by creating the two column family store file size will not be storing the unnecessary data, help in less split during compaction. IO performance will be improved

if there are two column family A and B and cardinality of A is 1million and B is 1Billion, Data of A is spread across many regions and regions server.This makes mass scans for ColumnFamilyA less efficient.

regions are distributed as per the rowkey, so even if A has 1 million rows and has a good distribution across rowkeys. then yes you may need to scan all those regions. I don't think that will impact much but this can only be avoided by using different table for these two high versioned columns.

Tea with Java

Search This Blog

Hbase Performance - 1

Comments

Post a Comment