Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm

File compression brings two major benefits: it reduces the space needed to store files, and it speeds up data transfer across the network or to or from disk. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop.

What to compress?

1. Compressing input files

If the input file is compressed, then the bytes read in from HDFS is reduced, which means less time to read data. This time conservation is beneficial to the performance of job execution.

If the input files are compressed, they will be decompressed automatically as they are read by MapReduce, using the filename extension to determine which codec to use. For example, a file ending in .gz can be identified as gzip-compressed file and thus read with GzipCodec.

2. Compressing output files

Often we need to store the output as history files. If the amount of output per day is extensive, and we often need to store history results for future use, then these accumulated results will take extensive amount of HDFS space. However, these history files may not be used very frequently, resulting in a waste of HDFS space. Therefore, it is necessary to compress the output before storing on HDFS.

3. Compressing map output

Even if your MapReduce application reads and writes uncompressed data, it may benefit from compressing the intermediate output of the map phase. Since the map output is written to disk and transferred across the network to the reducer nodes, by using a fast compressor such as LZO or Snappy, you can get performance gains simply because the volume of data to transfer is reduced.

Hadoop Codecs

There are many different compression formats, tools, and algorithms, each with different characteristics. All compression algorithms exhibit a pace/time trade-off: faster compression and decompression speeds usually come at the expense of smaller space savings.

How much the size of your data decreases will depend on the chosen compression algorithm. If query performance is your top priority, compression may or may not make sense as the compression overhead can outweigh the benefits of reduced I/O. We recommend that you validate the impact of the various compression algorithms on your data before you decide whether to compress or not.

A codec, which is a shortened form of compressor/decompressor, is technology (software or hardware, or both) for compressing and decompressing data; it’s the implementation of a compression/decompression algorithm. You need to know that some codecs support something called splittable compression and that codecs differ in both the speed with which they can compress and decompress data and the degree to which they can compress it.

Splittable compression is an important concept in a Hadoop context. The way Hadoop works is that files are split if they’re larger than the file’s block size setting, and individual file splits can be processed in parallel by different mappers.

With most codecs, text file splits cannot be decompressed independently of other splits from the same file, so those codecs are said to be non-splittable, so MapReduce processing is limited to a single mapper. Because the file can be decompressed only as a whole, and not as individual parts based on splits, there can be no parallel processing of such a file, and performance might take a huge hit as a job waits for a single mapper to process multiple data blocks that can’t be decompressed independently.

Codec File Extension Splittable? Degree of Compression Compression Speed
Gzip .gz No Medium Medium
Bzip2 .bz2 Yes High Slow
Snappy .snappy No Medium Fast
LZO .lzo No, unless indexed Medium Fast

General Guidelines

  • You need to balance the processing capacity required to compress and uncompress the data, the disk IO required to read and write the data, and the network bandwidth required to send the data across the network. The correct balance of these factors depends upon the characteristics of your cluster and your data, as well as your usage patterns.
  • Compression is not recommended if your data is already compressed (such as images in JPEG format). In fact, the resulting file can actually be larger than the original.
  • GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio. GZip is often a good choice for cold data, which is accessed infrequently. Snappy or LZO are a better choice for hot data, which is accessed frequently.
  • BZip2 can also produce more compression than GZip for some types of files, at the cost of some speed when compressing and decompressing. HBase does not support BZip2 compression.
  • Snappy often performs better than LZO. It is worth running tests to see if you detect a significant difference.
  • For MapReduce, if you need your compressed data to be splittable, BZip2, LZO, and Snappy formats are splittable, but GZip is not. Splittability is not relevant to HBase data.
  • For MapReduce, you can compress either the intermediate data, the output, or both. Adjust the parameters you provide for the MapReduce job accordingly.