Reputation: 683
Say i have some logs like
key1 something
key2 something
key3 something
I can use Pig MultiStorage to output the records to multiple folders based upon the key. But is there any way in MultiStorage where i can ensure that each key output contains only one file and not multiple small files as multistorage would do.
Upvotes: 0
Views: 1719
Reputation: 4724
By default MultiStorage
will store the same key in a single file, so you no need to do anything. In the below example there are 4 keys in the different input files, after storing the keys using MultiStorage option, it will create 4 dirs (key1,key2,key3 and key4)
and store the corresponding values in a single file. In the MultiStorage option '0'
is nothing but column 0
.
input1
key1,aaaa
key2,bbbb
key3,cccc
key2,bbbb
key3,eeee
key3,ffff
input2
key1,zzzz
key2,xxxx
key4,cccc
key3,yyyy
key4,ffff
input3:
key1,iiii
key2,jjjj
key3,kkkk
key4,llll
PigScript:
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input*' USING PigStorage(',') AS (f1,f2);
STORE A INTO 'output' USING org.apache.pig.piggybank.storage.MultiStorage('output', '0');
output$ ls
_SUCCESS key1 key2 key3 key4
output$ cat key1/key1-0,000
key1 aaaa
key1 zzzz
key1 iiii
output$ cat key2/key2-0,000
key2 bbbb
key2 bbbb
key2 xxxx
key2 jjjj
output$ cat key3/key3-0,000
key3 cccc
key3 eeee
key3 ffff
key3 yyyy
key3 kkkk
output$ cat key4/key4-0,000
key4 cccc
key4 ffff
key4 llll
Option2:
The other option is using SPLIT
command to store the same key in a single file.
PigScript:
A = LOAD 'input' USING PigStorage(',') AS (f1,f2);
SPLIT A INTO k1 IF (f1=='key1'), k2 IF (f1=='key2'), k3 IF (f1=='key3');
STORE k1 INTO 'OutputKey1' USING PigStorage();
STORE k2 INTO 'OutputKey2' USING PigStorage();
STORE k3 INTO 'OutputKey3' USING PigStorage();
Now the output dir OutputKey1,OutputKey2,OutputKey3
will store the result of the corresponding key values.
Upvotes: 1