Reputation: 4982
I need help for following use case:
Initially we load some files and and process those records (or more technically tuples). After this processing, finally we have tuples of the form:
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00001, some_field_3)
So basically, tuples has file path as value of its field (We can obviously transform this tuple having only one field having file path as value OR to a single tuple having only one field with some delimiter (say comma) separated string).
So now I have to load these files in Pig script, but I am not able to do so. Could you please suggest how to proceed further. I thought of using advanced foreach
operator and tried as follows:
data = foreach tuples_with_file_info {
fileData = load $2 using PigStorage(',');
....
....
};
However its not working.
Edit: For simplicity lets assume, I have single tuple with one field having file name:
(hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000)
Upvotes: 0
Views: 995
Reputation: 4982
First, store the tuples_with_file_info
into some file:
STORE tuples_with_file_info INTO 'some_temporary_file';
then,
data = LOAD 'some_temporary_file' using MyCustomLoader();
where
MyCustomLoader
is nothing but a Pig
loader extending LoadFunc
, which uses MyInputFormat
as InputFormat
.
MyInputFormat
is an encapsulation over the actual InputFormat (e.g. TextInputFormat
) which has to be used to read actual data from the files (e.g. in my case from file hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000).
In MyInputFormat
, override getSplits
method; first read the actual file name(s) from the some_temporary_file
(You have to get this file name from Configuration
's mapred.input.dir
property), then update the same Configuration
mapred.input.dir
with retrieved file names, then return result from wrapped up InputFormat (e.g. in my case TextInputFormat
).
Note: 1. You cannot use the setLocation
API from the LoadFunc
(or some other similar API) to read the contents of some_temporary_file
, as its contents will be available only at run time.
2. One doubt may arise in your mind, what if LOAD
statement executes before STORE
? But this would not happen because if STORE
and LOAD
use same file in the script, Pig
ensures that the jobs are executed in the right sequence. For more detail you may read section Store-load sequences on Pig Wiki
Upvotes: 0
Reputation: 1177
You can't use Pig out of the box to do it.
What I would do is use some other scripting language (bash, Python, Ruby...) to read the file from hdfs and concatenate the files into a single string that you can then push as a parameter to a Pig script to use in your LOAD statement. Pig supports globbing so you can do the following:
a = LOAD '{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}' ...
so all that's left to do is read the file that contains those file names, concatenate them into a glob such as:
{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
and pass that as a parameter to Pig so your script would start with:
a = LOAD '$input'
and your pig call would look like this:
pig -f script.pig -param input={hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
Upvotes: 1