Vasu
Vasu

Reputation: 4982

Pig load files using tuple's field

I need help for following use case:

Initially we load some files and and process those records (or more technically tuples). After this processing, finally we have tuples of the form:

(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00001, some_field_3)

So basically, tuples has file path as value of its field (We can obviously transform this tuple having only one field having file path as value OR to a single tuple having only one field with some delimiter (say comma) separated string).

So now I have to load these files in Pig script, but I am not able to do so. Could you please suggest how to proceed further. I thought of using advanced foreach operator and tried as follows:

data = foreach tuples_with_file_info {
    fileData = load $2 using PigStorage(',');
    ....
    ....
};

However its not working.

Edit: For simplicity lets assume, I have single tuple with one field having file name:

(hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000)

Upvotes: 0

Views: 995

Answers (2)

Vasu
Vasu

Reputation: 4982

First, store the tuples_with_file_info into some file:

STORE tuples_with_file_info INTO 'some_temporary_file';

then,

data = LOAD 'some_temporary_file' using MyCustomLoader();

where MyCustomLoader is nothing but a Pig loader extending LoadFunc, which uses MyInputFormat as InputFormat.

MyInputFormat is an encapsulation over the actual InputFormat (e.g. TextInputFormat) which has to be used to read actual data from the files (e.g. in my case from file hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000).

In MyInputFormat, override getSplits method; first read the actual file name(s) from the some_temporary_file (You have to get this file name from Configuration's mapred.input.dir property), then update the same Configuration mapred.input.dir with retrieved file names, then return result from wrapped up InputFormat (e.g. in my case TextInputFormat).

Note: 1. You cannot use the setLocation API from the LoadFunc (or some other similar API) to read the contents of some_temporary_file, as its contents will be available only at run time. 2. One doubt may arise in your mind, what if LOAD statement executes before STORE? But this would not happen because if STORE and LOAD use same file in the script, Pig ensures that the jobs are executed in the right sequence. For more detail you may read section Store-load sequences on Pig Wiki

Upvotes: 0

SNeumann
SNeumann

Reputation: 1177

You can't use Pig out of the box to do it.

What I would do is use some other scripting language (bash, Python, Ruby...) to read the file from hdfs and concatenate the files into a single string that you can then push as a parameter to a Pig script to use in your LOAD statement. Pig supports globbing so you can do the following:

a = LOAD '{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}' ...

so all that's left to do is read the file that contains those file names, concatenate them into a glob such as:

{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}

and pass that as a parameter to Pig so your script would start with:

a = LOAD '$input'

and your pig call would look like this:

pig -f script.pig -param input={hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}

Upvotes: 1

Related Questions