Data Block Encoding Types
1.
Prefix
2.
Diff
3.
FastDiff
4.
Prefix
Tree
Sample HBASE Data without any Encoding used
|
Key Len |
Val Len |
Key |
Value |
|
24 |
3 |
RowKey:Family:Qualifer0 |
abc |
|
24 |
|
RowKey:Family:Qualifer1 |
|
|
24 |
|
RowKey:Family:QualiferN |
|
|
25 |
|
RowKey1:Family:Qualifer0 |
|
|
25 |
|
RowKey1:Family:Qualifer1 |
|
|
25 |
|
RowKey1:Family:Qualifer2 |
|
Prefix Data Block Encoding –
· Key are some what similar
· Key share the common prefix
In this an extra column is added which hold the length of the prefix shared between the present and the previous key
This type of key is useless if the key of the table has nothing in common prefixes with the previous after sharing.
For instance, one key might be RowKey:Family:Qualifier0 and
the next key might be RowKey:Family:Qualifier1. In
Prefix encoding, an extra column is added which holds the length of the prefix
shared between the current key and the previous key. Assuming the first key
here is totally different from the key before, its prefix length is 0. The
second key's prefix length is 23, since they
have the first 23 characters in common
|
Key Len |
Prefix Len |
Val Len |
Key |
Value |
|
24 |
0 |
3 |
RowKey:Family:Qualifer0 |
abc |
|
1 |
23 |
|
1 |
|
|
1 |
23 |
|
N |
|
|
19 |
6 |
|
1:Family:Qualifer0 |
|
|
1 |
24 |
|
1 |
|
|
1 |
24 |
|
2 |
|
Diff Data Block
Encoding Types -
Expands the prefix encoding, instead of considering the key sequentially as monolithic series of bytes each key fields is split so that ach part can be compressed more efficiently
Two new fields are added:
· Timestamp
· Type.
If the ColumnFamily is the same as the previous row, it is omitted from the current row.
If the key length, value length or types are the same as the previous row, the field is omitted.
In addition, for increased compression, the timestamp is stored as a Diff from the previous row’s timestamp, rather than being stored in full. Given the two row keys in the Prefix example, and given an exact match on timestamp and the same type, neither the value length, or type needs to be stored for the second row, and the timestamp value for the second row is just 0, rather than a full timestamp.
Diff encoding is disabled by default because writing and scanning are slower but more data is cached.
|
Flags |
Key Len |
Prefix Len |
Val Len |
Key |
Timestamp |
Type |
Value |
|
0 |
24 |
0 |
512 |
RowKey:Family:Qualifer0 |
130466835 |
4 |
abc |
|
5 |
|
23 |
320 |
1 |
0 |
|
|
|
3 |
|
23 |
|
N |
120 |
8 |
|
|
0 |
25 |
6 |
576 |
1:Family:Qualifer0 |
25 |
4 |
|
|
5 |
|
24 |
384 |
1 |
1124 |
|
|
|
|
|
24 |
|
2 |
|
|
|
Fast Diff
Fast Diff
works similar to Diff, but uses a faster implementation. It also adds another field which stores a single bit to track whether the data
itself is the same as the previous row. If it is, the data is not stored again.
Fast Diff is the recommended codec to use if you have long keys or many columns.
The data format is nearly identical to Diff encoding, so there is not an image to illustrate it.
Prefix Tree
Prefix tree
encoding was introduced as an experimental feature in HBase
0.96. It provides similar memory savings to the Prefix, Diff, and Fast Diff
encoder, but provides faster random access at a cost of slower encoding speed.
Prefix Tree may be appropriate for applications that have high block cache hit ratios. It introduces new 'tree' fields for the row and column. The row tree field contains a list of offsets/references corresponding to the cells in that row. This allows for a good deal of compression. For more details about Prefix Tree encoding, see HBASE-4676.
It is difficult to graphically illustrate a prefix tree, so no image is included. See the Wikipedia article for Trie for more general information about this data structure.
Comments
Post a Comment