Hbase Performance - 4 Data Block Encoding Types

Data Block Encoding Types

1.    Prefix

2.    Diff

3.    FastDiff

4.    Prefix Tree

 

Sample HBASE Data without any Encoding used

 

Key Len

Val Len

Key

Value

24

3

RowKey:Family:Qualifer0

abc

24

 

RowKey:Family:Qualifer1

 

24

 

RowKey:Family:QualiferN

 

25

 

RowKey1:Family:Qualifer0

 

25

 

RowKey1:Family:Qualifer1

 

25

 

RowKey1:Family:Qualifer2

 

 

 

Prefix Data Block Encoding –

·      Key are some what similar

·      Key share the common prefix

In this an extra column is added which hold the length of the prefix shared between the present and the previous key

 

This type of key is useless if the key of the table has nothing in common prefixes with the previous after sharing.

 

For instance, one key might be RowKey:Family:Qualifier0 and the next key might be RowKey:Family:Qualifier1. In Prefix encoding, an extra column is added which holds the length of the prefix shared between the current key and the previous key. Assuming the first key here is totally different from the key before, its prefix length is 0. The second key's prefix length is 23, since they have the first 23 characters in common

 

Key Len

Prefix Len

Val Len

Key

Value

24

0

3

RowKey:Family:Qualifer0

abc

1

23

 

1

 

1

23

 

N

 

19

6

 

1:Family:Qualifer0

 

1

24

 

1

 

1

24

 

2

 

 

 

 

Diff Data Block Encoding Types -

Expands the prefix encoding, instead of considering the key sequentially as monolithic series of bytes each key fields is split so that ach part can be compressed more efficiently

 

Two new fields are added:

·      Timestamp

·      Type.

If the ColumnFamily is the same as the previous row, it is omitted from the current row.

If the key length, value length or types are the same as the previous row, the field is omitted.

 

In addition, for increased compression, the timestamp is stored as a Diff from the previous row’s timestamp, rather than being stored in full. Given the two row keys in the Prefix example, and given an exact match on timestamp and the same type, neither the value length, or type needs to be stored for the second row, and the timestamp value for the second row is just 0, rather than a full timestamp.

Diff encoding is disabled by default because writing and scanning are slower but more data is cached.

Flags

Key Len

Prefix Len

Val Len

Key

Timestamp

Type

Value

0

24

0

512

RowKey:Family:Qualifer0

130466835

4

abc

5

 

23

320

1

0

 

 

3

 

23

 

N

120

8

 

0

25

6

576

1:Family:Qualifer0

25

4

 

5

 

24

384

1

1124

 

 

 

 

24

 

2

 

 

 

 

 

 

 

 

 

Fast Diff

Fast Diff works similar to Diff, but uses a faster implementation. It also adds another field which stores a single bit to track whether the data itself is the same as the previous row. If it is, the data is not stored again.

Fast Diff is the recommended codec to use if you have long keys or many columns.

The data format is nearly identical to Diff encoding, so there is not an image to illustrate it.

Prefix Tree

Prefix tree encoding was introduced as an experimental feature in HBase 0.96. It provides similar memory savings to the Prefix, Diff, and Fast Diff encoder, but provides faster random access at a cost of slower encoding speed.

 Prefix Tree may be appropriate for applications that have high block cache hit ratios. It introduces new 'tree' fields for the row and column. The row tree field contains a list of offsets/references corresponding to the cells in that row. This allows for a good deal of compression. For more details about Prefix Tree encoding, see HBASE-4676.

It is difficult to graphically illustrate a prefix tree, so no image is included. See the Wikipedia article for Trie for more general information about this data structure.

 

Comments