Reputation: 600
I am writing a sanity tests for CDH 5.3 cluster installs. We have a test case that will create a Hive table on a directory with 1000 partitions and then query random partitions. Originally done with a series of for loops, it took hours to do:
hadoop fs -mkdir -p /hdfs/directory/partition{1...1000}
then:
hadoop fs -put /path/to/local/file /hdfs/directory/partitionX
Passing one local file to multiple directories just throws an error, but using a for loop takes hours to complete. -copyFromLocal throws a similar error to -put. Also, -put to the first directory and copied with a for loop takes quite a bit of time as well.
Any ideas on how to copy one file to many directories in the quickest and most efficient way possible?
Upvotes: 2
Views: 2198
Reputation: 1074
In order to speed up the copy, some kind of parallelism is needed. It would be easy to run a multi-thread program to submit dozens of hdfs copy command a time in java.
With shell script, you may do something like:
m=10
for (( i = 0; i < 100; i++ )); do
sh hdfs_cp_script partition$(($i*$m + 1)) & sh hdfs_cp_script partition$(($i*$m + 2) & ... & sh hdfs_cp_script partition$(($i*$m + 10))
done
to submit multiple (10) commands a time for a loop of 100.
Upvotes: 1
Reputation: 2725
A faster way to achieve this would be to write a Java application that uses the Hadoop FileSystem API to write the file into the various HDFS directories.
Upvotes: 1