Reputation: 2661
I’ve data in the following format:
(Id, Description)
1, xyz is something. Abc bcd & so on.
1, xyz is something. Abc xyz & so on.
2, abc is something. Abc xyz & so on.
I need output in this format:
Id, Word
I tried this:
A = LOAD './data.txt' USING PigStorage(',') as (id: int, desc:chararray);
B = FOREACH A GENERATE id, FLATTEN(STRSPLIT(desc, '[,?:;\s]'));
This results in output such as this:
1, xyz, is, something, Abc, bcd, so, on
What I want is:
1, xyz
1, is
1, something
etc etc..
How can I do this in Pig (without writing a UDF)?
PS: Also tried:
B = FOREACH A GENERATE id, FLATTEN(datafu.pig.util.TransposeTupleToBag(STRSPLIT(desc, '[.&,?:;\s]')));
Upvotes: 0
Views: 530
Reputation: 16
You can use Tokenize in pig. Please find below answer.
Here is the input file
cat file1
1,xyz is something
2,abc is something
A = load 'file1' using PigStorage(',');
B = foreach A generate $0, FLATTEN(TOKENIZE($1));
dump B
(1,xyz)
(1,is)
(1,something)
(2,abc)
(2,is)
(2,something)
Upvotes: 0