Reputation: 34693
I have a hdfs file list of csv files with same format. I need to be able to LOAD
them with pig together. Eg:
/path/to/files/2013/01-01/qwe123.csv
/path/to/files/2013/01-01/asd123.csv
/path/to/files/2013/01-01/zxc321.csv
/path/to/files/2013/01-02/ert435.csv
/path/to/files/2013/01-02/fgh987.csv
/path/to/files/2013/01-03/vbn764.csv
They can not be globed as their name is "random" hash and their directories might contain more csv files.
Upvotes: 2
Views: 5016
Reputation: 4575
As suggested in other comments, you can do this by pre-processing the file. Suppose your HDFS file is called file_list.txt
, then you can do the following:
pig -param flist=`hdfs dfs -cat file_list.txt | awk 'BEGIN{ORS="";}{if (NR == 1) print; else print ","$0;}'` script.pig
The awk
code gets rid of the newline characters and uses commas to separate the file names.
In your script (called script.pig
in my example), you should use parameter substitution to load the data:
data = LOAD '$flist';
Upvotes: 1
Reputation: 5811
You aren't restricted to globbing. Use this:
LOAD '/path/to/files/2013/01-{01/qwe123,01/asd123,01/zxc321,02/ert435,02/fgh987,03/vbn764}.csv';
Upvotes: 1