Tuesday, November 16, 2010

How to optimize hadoop jobs: Best Practices and thumb rules

Hadoop provides a lot of ways to optimize the jobs to make them run faster for your data but the following are the thumb rules that should be followed before spending more time on it.

- Number of mappers
How long are you mappers running for? If they are only running for a few seconds on average, then you should see if there’s a way to have fewer mappers and make them all run longer, a minute or so, as a rule of thumb. The extent to which this is possible depends on the input format you are using.

- Number of reducers
For maximum performance, the number of reducers should be slightly less than the number of reduce slots in the cluster. This allows the reducers to finish in one wave, and fully utilizes the cluster during the reduce phase.

- Combiners
Can your job take advantage of a combiner to reduce the amount of data in passing through the shuffle?

- Intermediate compression
Job execution time can almost always benefit from enabling map output compression.

- Custom serialization
If you are using your own custom Writable objects, or custom comparators,then make sure you have implemented RawComparator.

- Shuffle tweaks
The MapReduce shuffle exposes around a dozen tuning parameters for memory management, which may help you eke out the last bit of performance.

No comments:

Post a Comment