Hbase Performance - 3 LZO and Snappy Compression

LZO Compression
A block compression algorithm ,it splits the file and compress it in to individual blocks and write onto HDFS. So even when you read it doesn't need to merge all the blocks and put in one place. It can read blocks at the node level, decompress and fetch.
   LZ4 – a newer variant optimized for speed at the cost of compression ratio


Snappy Compression 
    Snappy is a project developed by google   https://github.com/google/snappy. Snappy is related to Lempel-ZIV family of compression algorithm, (LZO is also part of this family)
it takes whole file, compress it and load into blocks. The drawback here is when we try to read the blocks, it can decompress only at file level. It can't decompress at node level instead it has to bring all blocks to one place  and then read from file header till end of the file. The reason behind this behavior is file in Snappy it is not splittable before loading in HDFS. It compress the file as it is and process further.



Benefits of Snappy Over LZO Compression  
In Context of Apache Hadoop

  • Snappy is faster in Decompression and comparable in Compression than LZO, so in total trip time Snappy is superior than LZO Compression
  • Snappy Comes under BSD license so can be shipped with Hadoop, LZO comes with GPL license so downloaded and installed separately(Cloudera installation HBase contains Snappy) 
Many Hadoop cluster uses  LZO compression for intermediate map reduce output. This is not used by users always written to the disks by the mapper and than accessed across the network by the reducer. LZO is the prime candidate for compression since it tends to be compressible (there is some redundancy in the key space, since the map outputs are sorted), and because writing to disk is slow it pays to perform some light compression to reduce the number of bytes written (and later read).Snappy and LZO are not CPU intensive, which is important, as other map and reduce processes running at the same time will not be deprived of CPU time. In testing, we have seen that the performance of Snappy is generally comparable to LZO, with up to a 20% improvement in overall job time in some cases.
This use alone justifies installing Snappy, but there are other places Snappy can be used within Hadoop applications. For example, Snappy can be used for block compression in all the commonly-used Hadoop file formats, including Sequence Files, Avro Data Files, and HBase tables.
One thing to note is that Snappy is intended to be used with a container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce. This is different to LZO, where is is possible to index LZO compressed files to determine split points so that LZO files can be processed efficiently in subsequent processing.

How to use Snappy with Hadoop

Snappy support was added to Hadoop in HADOOP-7206, which will be available in the forthcoming 0.23.0 Apache release. Enabling map output compression is as simple as adding the following to mapred-site.xml:
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

Comments