Himanshu Gahlot
Himanshu Gahlot

Reputation: 235

Flatten returing only the first item in chararray

I have the following data (loaded in variable A):

(a1:a2:a3|a4:a5:a6)
(b1:b2:b3)
(c1:c2:c3|c4:c5:c6|c7:c8:c9)

I want my final output to be as follows:

(a1:a2:a3)
(a4:a5:a6)
(b1:b2:b3)
(c1:c2:c3)
(c4:c5:c6)
(c7:c8:c9)

Here is what I did:

B = foreach B generate flatten(STRSPLIT($0, '\\|')) as splitted:chararray;

This converted the data to:

(a1:a2:a3,a4:a5:a6)
(b1:b2:b3)
(c1:c2:c3,c4:c5:c6,c7:c8:c9)

with the following structure:

B: {splitted: chararray}

However, when I try to flatten this chararray into separate tuples it only spits out the first item. I have tried several different ways to get the output I want but I always get the first item. Here are a couple of things I tried:

req_output = foreach B generate flatten(STRSPLIT(splitted, ','));

req_output = foreach B generate flatten(TOBAG(*));

In both cases I get the following output:

(a1:a2:a3)
(b1:b2:b3)
(c1:c2:c3)

I am not sure why this is happening. How can I get all the items as different tuples? I do not have much experience in pig so any help would be appreciated.

Upvotes: 1

Views: 253

Answers (1)

Sivasakthi Jayaraman
Sivasakthi Jayaraman

Reputation: 4724

In the relation B you are storing only the first item(i.e splitted variable), that is the reason for this issue. Can you remove the variable splitted from the relation B?

B = foreach B generate flatten(STRSPLIT($0, '\\|')) as splitted:chararray;

TO

B = foreach B generate flatten(STRSPLIT($0, '\\|'));

You can solve this problem in couple of ways.

input:

a1:a2:a3|a4:a5:a6
b1:b2:b3
c1:c2:c3|c4:c5:c6|c7:c8:c9

Option1: Using TOKENIZE

A = LOAD 'input' USING PigStorage() AS(line:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(line,'\\|'));
DUMP B;

Option2: Using STRSPLIT + TOBAG

A = LOAD 'input' USING PigStorage() AS(line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\|'));
C = FOREACH B GENERATE FLATTEN(TOBAG(*));
DUMP C;

Option3: Using STRSPLITTOBAG (Only in Pig Version 0.14)

A = LOAD 'input' USING PigStorage() AS(line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLITTOBAG(line,'\\|'));
DUMP B;

Output:

(a1:a2:a3)
(a4:a5:a6)
(b1:b2:b3)
(c1:c2:c3)
(c4:c5:c6)
(c7:c8:c9)

Upvotes: 3

Related Questions