DilTeam
DilTeam

Reputation: 2661

Flattening tuples in Pig

I’ve data in the following format:

(Id, Description)

1, xyz is something. Abc bcd & so on.

1, xyz is something. Abc xyz & so on.

2, abc is something. Abc xyz & so on.

I need output in this format:

Id, Word

I tried this:

A = LOAD './data.txt' USING PigStorage(',') as (id: int, desc:chararray);

B = FOREACH A GENERATE id, FLATTEN(STRSPLIT(desc, '[,?:;\s]'));

This results in output such as this:

1, xyz, is, something, Abc, bcd, so, on

What I want is:

1, xyz

1, is

1, something

etc etc..

How can I do this in Pig (without writing a UDF)?

PS: Also tried:

B = FOREACH A GENERATE id, FLATTEN(datafu.pig.util.TransposeTupleToBag(STRSPLIT(desc, '[.&,?:;\s]')));

Upvotes: 0

Views: 530

Answers (1)

hemasundar b
hemasundar b

Reputation: 16

You can use Tokenize in pig. Please find below answer.

Here is the input file

cat file1

1,xyz is something

2,abc is something

A = load 'file1' using PigStorage(',');

B = foreach A generate $0, FLATTEN(TOKENIZE($1));

dump B

(1,xyz)

(1,is)

(1,something)

(2,abc)

(2,is)

(2,something)

Upvotes: 0

Related Questions