Wednesday, January 25, 2012

How to compress the interim output of Mappers and final output of a Map Reduce Jon in Hadoop

I was testing different ways to improve performance of hadoop jobs and was testing how compression helps.

There are 2 places where you can configure a hadoop job to use compression

1. Compress the intermediate output of the mapper
To do this for all jobs you can set it in mapred-site.xml by adding the following properties
<property>
    <name> mapreduce.map.output.compress </name
    <value>true</value
</property>
<property>
    <name>mapreduce.map.output.compress.codec</name>
    <value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>

I am compressing using GzipCodec but you have the option to use any of the following
- GzipCodec
- DeflateCodec
- BZip2Codec
- SnappyCodec
Each of these have their strengths and weaknesses, choose what you can live with. Also not that due to some licensing differences, LZO does not ships with Hadoop. You can install it separately and use it if you'd like.

For just your job you can set it in the Configuration object
Configuration conf = new Configuration();
conf.setBoolean("mapreduce.map.output.compress", true);
conf.set("mapreduce.map.output.compress.codec","org.apache.hadoop.io.compress.GzipCodec");


2. Compress the final output of the job
To save the final output in gzip, run your M/R job with following code

job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat.setCompressOutput(job, true);
TextOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

1 comment: