Load multiple files with PigLatin (Hadoop)

Question

I have a hdfs file list of csv files with same format. I need to be able to LOAD them with pig together. Eg:

/path/to/files/2013/01-01/qwe123.csv
/path/to/files/2013/01-01/asd123.csv
/path/to/files/2013/01-01/zxc321.csv
/path/to/files/2013/01-02/ert435.csv
/path/to/files/2013/01-02/fgh987.csv
/path/to/files/2013/01-03/vbn764.csv

They can not be globed as their name is "random" hash and their directories might contain more csv files.

cabad · Accepted Answer

As suggested in other comments, you can do this by pre-processing the file. Suppose your HDFS file is called file_list.txt, then you can do the following:

pig -param flist=`hdfs dfs -cat file_list.txt | awk 'BEGIN{ORS="";}{if (NR == 1) print; else print ","$0;}'` script.pig

The awk code gets rid of the newline characters and uses commas to separate the file names.

In your script (called script.pig in my example), you should use parameter substitution to load the data:

data = LOAD '$flist';

Load multiple files with PigLatin (Hadoop)

Answers (2)

Related Questions