Hbase Performance - 4 Data Block Encoding Types

Data Block Encoding Types

1. Prefix

2. Diff

3. FastDiff

4. Prefix Tree

Sample HBASE Data without any Encoding used

Key Len	Val Len	Key	Value
24	3	RowKey:Family:Qualifer0	abc
24		RowKey:Family:Qualifer1
24		RowKey:Family:QualiferN
25		RowKey1:Family:Qualifer0
25		RowKey1:Family:Qualifer1
25		RowKey1:Family:Qualifer2

Prefix Data Block Encoding –

· Key are some what similar

· Key share the common prefix

In this an extra column is added which hold the length of the prefix shared between the present and the previous key

This type of key is useless if the key of the table has nothing in common prefixes with the previous after sharing.

For instance, one key might be RowKey:Family:Qualifier0 and the next key might be RowKey:Family:Qualifier1. In Prefix encoding, an extra column is added which holds the length of the prefix shared between the current key and the previous key. Assuming the first key here is totally different from the key before, its prefix length is 0. The second key's prefix length is 23, since they have the first 23 characters in common

Key Len	Prefix Len	Val Len	Key	Value
24	0	3	RowKey:Family:Qualifer0	abc
1	23		1
1	23		N
19	6		1:Family:Qualifer0
1	24		1
1	24		2

Diff Data Block Encoding Types -

Expands the prefix encoding, instead of considering the key sequentially as monolithic series of bytes each key fields is split so that ach part can be compressed more efficiently

Two new fields are added:

· Timestamp

· Type.

If the ColumnFamily is the same as the previous row, it is omitted from the current row.

If the key length, value length or types are the same as the previous row, the field is omitted.

In addition, for increased compression, the timestamp is stored as a Diff from the previous row’s timestamp, rather than being stored in full. Given the two row keys in the Prefix example, and given an exact match on timestamp and the same type, neither the value length, or type needs to be stored for the second row, and the timestamp value for the second row is just 0, rather than a full timestamp.

Diff encoding is disabled by default because writing and scanning are slower but more data is cached.

Flags	Key Len	Prefix Len	Val Len	Key	Timestamp	Type	Value
0	24	0	512	RowKey:Family:Qualifer0	130466835	4	abc
5		23	320	1	0
3		23		N	120	8
0	25	6	576	1:Family:Qualifer0	25	4
5		24	384	1	1124
		24		2

Fast Diff

Fast Diff works similar to Diff, but uses a faster implementation. It also adds another field which stores a single bit to track whether the data itself is the same as the previous row. If it is, the data is not stored again.

Fast Diff is the recommended codec to use if you have long keys or many columns.

The data format is nearly identical to Diff encoding, so there is not an image to illustrate it.

Prefix Tree

Prefix tree encoding was introduced as an experimental feature in HBase 0.96. It provides similar memory savings to the Prefix, Diff, and Fast Diff encoder, but provides faster random access at a cost of slower encoding speed.

Prefix Tree may be appropriate for applications that have high block cache hit ratios. It introduces new 'tree' fields for the row and column. The row tree field contains a list of offsets/references corresponding to the cells in that row. This allows for a good deal of compression. For more details about Prefix Tree encoding, see HBASE-4676.

It is difficult to graphically illustrate a prefix tree, so no image is included. See the Wikipedia article for Trie for more general information about this data structure.

Tea with Java

Search This Blog

Hbase Performance - 4 Data Block Encoding Types

Comments

Post a Comment