Sreenath Kamath
Sreenath Kamath

Reputation: 683

Output Of Pig MultiStorage To a Single File

Say i have some logs like

key1 something
key2 something
key3 something

I can use Pig MultiStorage to output the records to multiple folders based upon the key. But is there any way in MultiStorage where i can ensure that each key output contains only one file and not multiple small files as multistorage would do.

Upvotes: 0

Views: 1719

Answers (1)

Sivasakthi Jayaraman
Sivasakthi Jayaraman

Reputation: 4724

By default MultiStorage will store the same key in a single file, so you no need to do anything. In the below example there are 4 keys in the different input files, after storing the keys using MultiStorage option, it will create 4 dirs (key1,key2,key3 and key4) and store the corresponding values in a single file. In the MultiStorage option '0' is nothing but column 0.

input1

key1,aaaa
key2,bbbb
key3,cccc
key2,bbbb
key3,eeee
key3,ffff

input2

key1,zzzz
key2,xxxx
key4,cccc
key3,yyyy
key4,ffff

input3:

key1,iiii
key2,jjjj
key3,kkkk
key4,llll

PigScript:

REGISTER '/tmp/piggybank.jar';
A = LOAD 'input*' USING PigStorage(',') AS (f1,f2);
STORE A INTO 'output' USING org.apache.pig.piggybank.storage.MultiStorage('output', '0');

output$ ls

_SUCCESS    key1        key2        key3        key4

output$ cat key1/key1-0,000

key1    aaaa
key1    zzzz
key1    iiii

output$ cat key2/key2-0,000

key2    bbbb
key2    bbbb
key2    xxxx
key2    jjjj

output$ cat key3/key3-0,000

key3    cccc
key3    eeee
key3    ffff
key3    yyyy
key3    kkkk

output$ cat key4/key4-0,000

key4    cccc
key4    ffff
key4    llll

Option2:
The other option is using SPLIT command to store the same key in a single file.

PigScript:

A = LOAD 'input' USING PigStorage(',') AS (f1,f2);
SPLIT A INTO k1 IF (f1=='key1'), k2 IF (f1=='key2'), k3 IF (f1=='key3');
STORE k1 INTO 'OutputKey1' USING PigStorage();
STORE k2 INTO 'OutputKey2' USING PigStorage();
STORE k3 INTO 'OutputKey3' USING PigStorage();

Now the output dir OutputKey1,OutputKey2,OutputKey3 will store the result of the corresponding key values.

Upvotes: 1

Related Questions