Sunday, November 21, 2010

Where to get Hadoop traning

We keep on hearing how all companies are trying to move to Hadoop to analyze big data but even as of today, Hadoop traning is very limited.

At this time Cloudera is only company that gives training for Hadoop. You can get more details at www.cloudera.com/hadoop-training. We had this training and found it useful to get introduced to Hadoop.

It walks you through
- All the concepts of Hadoop
- Discuss the ecosystem/tools that is getting built around hadoop, like Hive, Zookeeper, Hbase etc.
- Make you write some hive queries
- Make you write a MapReduce program

The training is good but it definately have lot of room for improvement. 3 top concepts that I would have liked the training to cover were
- Real world example of how sorting of keys is a very powerful feature of hadoop
- Real world example of how to use the Partitioner class
- How to handle data skews in Hadoop

From my own experience, grasp of these 3 concepts means you can start using Hadoop in real world production. If you dont understand any of these 3 concepts then you will find yourself very limited in the world of Hadoop.

2 comments:

  1. Hi,
    I have some basic knowledge of Hadoop. I understood the first concept, and have some idea about the second, but do not know at all what you mean by the third... can you blog that or reply that to me?

    ReplyDelete
  2. The 3rd concept is data skews. It means that input data is not well distributed due to which some of your reducers get a lot of work and others will get less work. This means total time to execute the job will depend on the slowest reducer (that got the biggest data set) and if skew increases proportionally with data then its also possible that your job works for small data sets (like 1 hour of data) but fails for big data sets (like if same job is run on 1 month of data).

    Let me explain with an example.
    Suppose you have a file with 300 Million rows. File has 2 columns FTPUserName and FTPUserIP. 299 Million rows have FTPUserName=anonymous and the rest 1 million FTPUserName are distinct. You are asked to find the distinct list of IP addresses used by each FTPUser.

    First instinct is to
    - write a mapper that emits FTPUserName as key and FTPUserIP as value
    - write a reducer that counts distinct IP for each FTPUserName

    In this case 1 million IPs will be spread across all available reducers evenly (which is desired behavior) but 299 million IP's will go to just one reducer and tremendously slow down the whole job (in my case it failed the job because the amount of data sent to one reducer was more than the temporary disk space assigned to a reducer)

    Hope this explains it.

    ReplyDelete