Ameya Khandelwal
Ameya Khandelwal

Reputation: 21

Multiple Output for multiple files loaded in PIG

I have 50 text files in my data directory(path:/home/admin/Desktop/data). My task is to flatten (tokenize) the data inside the text file and store the output in 50 output files.

Following are the relations that i made to do the job:

--This will load all the 50 text files.
A = Load '/home/admin/Desktop/data' Using PigStorage(','); 

--This relation will create every word as a token and will flatten the data.
B = FOREACH A GENERATE FLATTEN(TOKENIZE($0));

STORE B into '/home/ameya/Desktop/PigOutput';

Now when i am executing this pig script I am getting only one output file for 50 input files.

How to get 50 different output files, each file contains output data corresponding to the data in it's input file ?

Upvotes: 2

Views: 1325

Answers (2)

chanchal1987
chanchal1987

Reputation: 2357

Have you tried PIG MultiStorage UDF?

If you want to create 50 o/p files for separate 50 i/p files then it is better to run your PIG script 50 times(in a loop) and use i/p file and o/p file as parameter of your PIG script.

Upvotes: 1

Sai Neelakantam
Sai Neelakantam

Reputation: 939

Split operator can be used to partition the contents of a relation into two or more relations based on some expression. Based on the conditions provided in the expression either of the below two will be done:

  • A tuple may be assigned to more than one relation
  • A tuple may not be assigned to any relation

Multiple files in a directory which are used in pig to load, flatten and store:

[user1@localhost ~]# ls /pigsamples/mfilesdata/
file1  file2  file3

Loading above directory:

grunt> input_data = LOAD '/pigsamples/mfilesdata' USING PigStorage (',') AS (f1:INT, f2:INT, f3:INT);
grunt> DUMP input_data;
(1,2,3)
(2,3,1)
(3,1,2)
(4,5,6)
(5,6,4)
(6,4,5)
(7,8,9)
(8,9,7)
(9,7,8)

Format the data based on your requirements. I used the same action as in the question.

grunt> formatted_data = FOREACH input_data GENERATE FLATTEN(TOKENIZE($0));    //replace with your requirements

Use SPLIT operator to split the relation into multiple relations based on the conditions.

grunt> 
SPLIT formatted_data 
INTO split1 IF f1 <= 3, 
split2 IF (f1 > 3 AND f1 <= 6), 
split3 IF f1 > 6;       //split based on the column which is unique within all the files

Output:

grunt> DUMP split1;
(1,2,3)
(2,3,1)
(3,1,2)

grunt> DUMP split2;
(4,5,6)
(5,6,4)
(6,4,5)

grunt> DUMP split3;
(7,8,9)
(8,9,7)
(9,7,8)

Upvotes: 1

Related Questions