Chris Marotta
Chris Marotta

Reputation: 600

Put one local file into multiple HDFS directories

I am writing a sanity tests for CDH 5.3 cluster installs. We have a test case that will create a Hive table on a directory with 1000 partitions and then query random partitions. Originally done with a series of for loops, it took hours to do:

hadoop fs -mkdir -p /hdfs/directory/partition{1...1000}

then:

hadoop fs -put /path/to/local/file /hdfs/directory/partitionX

Passing one local file to multiple directories just throws an error, but using a for loop takes hours to complete. -copyFromLocal throws a similar error to -put. Also, -put to the first directory and copied with a for loop takes quite a bit of time as well.

Any ideas on how to copy one file to many directories in the quickest and most efficient way possible?

Upvotes: 2

Views: 2198

Answers (2)

Paul H.
Paul H.

Reputation: 1074

In order to speed up the copy, some kind of parallelism is needed. It would be easy to run a multi-thread program to submit dozens of hdfs copy command a time in java.

With shell script, you may do something like:

m=10
for (( i = 0; i < 100; i++ )); do
   sh hdfs_cp_script partition$(($i*$m + 1)) & sh hdfs_cp_script partition$(($i*$m + 2) & ... & sh hdfs_cp_script partition$(($i*$m + 10))
done

to submit multiple (10) commands a time for a loop of 100.

Upvotes: 1

Jeremy Beard
Jeremy Beard

Reputation: 2725

A faster way to achieve this would be to write a Java application that uses the Hadoop FileSystem API to write the file into the various HDFS directories.

Upvotes: 1

Related Questions