Thursday, November 18, 2010

How to use distcp in Hadoop


DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses Map/Reduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. Its Map/Reduce pedigree has endowed it with some quirks in both its semantics and execution. The purpose of this document is to offer guidance for common tasks and to elucidate its model.

Following are some examples of distcp using the hadoop client
hadoop distcp hdfs://SourceCluster:9000/foo/bar hdfs://DestinationCluster:9000/bar/foo

You can get the values hdfs://SourceCluster:9000 and hdfs://DestinationCluster:9000 from property fs.default.name from file core-site.xml of your hadoop implementation.

Following is an example of how to copy multiple directories
hadoop distcp hdfs:// SourceCluster:9000/foo/a hdfs:// SourceCluster:9000/foo/b hdfs://DestinationCluster:9000/bar/foo

No comments:

Post a Comment