san71
san71

Reputation: 47

Pig and Parsing issue

I am trying to figure out the best way to parse key value pair with Pig in a dataset with mixed delimiters as below

My sample dataset is in the format below

 a|b|c|k1=v1 k2=v2 k3=v3

The final output which i require here is

k1,v1,k2,v2,k3,v3

I guess one way to do this is to

A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;

and here i get (k1=v1 k2=v2 k3=v3) for B Is there any way i can further parse this by "" so as to get 3 fields k1=v1,k2=v2 and K3=v3 which can then be further split into k1,v1,k2,v2,k3,v3 using Strsplit and Flatten on "=".

Thanks for the help!

San

Upvotes: 0

Views: 96

Answers (1)

Avani
Avani

Reputation: 46

If you know beforehand how many key=value pair are in each record, try this:

A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);

B = foreach A generate d1;

C = FOREACH B GENERATE STRSPLIT($0,'=',6);  -- 6= no. of key=value pairs

D = FOREACH C GENERATE FLATTEN($0);

DUMP D

output: (k1,v1, k2,v2, k3,v3)

If you dont know the # of key=value pair, use ' ' as delimiter and remove the unwanted prefix from $0 column.

A = LOAD 'sample' USING PigStorage(' ') as (a:chararray,b:chararray,c:chararray);

B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2);

C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2);

DUMP C;

output: (k1,v1, k2,v2, k3,v3)

Upvotes: 1

Related Questions