Amandasaurus
Amandasaurus

Reputation: 60679

Apache Pig not parsing a tuple fully

I have a file called data that looks like this: (note there are tabs after the 'personA')

personA (1, 2, 3)
personB (2, 1, 34)

And I have an Apache pig script like this:

A = LOAD 'data' AS (name: chararray, nodes: tuple(a:int, b:int, c:int));
C = foreach A generate nodes.$0;
dump C;

The output of which makes sense:

(1)
(2)

However if I change the schema of the script to be like this:

A = LOAD 'data' AS (name: chararray, nodes: tuple());
C = foreach A generate nodes.$0;
dump C;

Then the output I get is this:

(1, 2, 3)
(2, 1, 34)

It looks like the first (and only) element in this tuple is a bytearray. i.e. it's not parsing the input text 1, 2, 3 into a tuple.

In future my input will have an unknown & variable number of elements in the nodes item, so I can't just write out a:int, ….

Is there anyway to get Pig to parse the input tuple as a tuple without having to write out the full schema?

Upvotes: 1

Views: 3940

Answers (3)

vaiz84
vaiz84

Reputation: 41

Here is another way of tackling this issue, although I know the answers above are more efficient.

data = LOAD 'data' USING PigStorage() AS (name:chararray, field2:chararray);

data = FOREACH data GENERATE name, REPLACE(REPLACE(field2, '\\(',''),'\\)','') AS field2;  

data = FOREACH data GENERATE name, STRSPLIT(field2, '\\,') AS fieldTuple;

data = FOREACH data GENERATE name, fieldTuple.$0,fieldTuple.$1, fieldTuple.$2 ;
  1. Load field2 as chararray
  2. Remove parentheses
  3. Split field2 by comma (it gives you a tuple with 3 fields in it)
  4. Get values by index

I know it is hacky. Just wanted to provide another way of doing this

Upvotes: 0

Senthil Nathan
Senthil Nathan

Reputation: 1

This is no more a limitation. Pig parses the tuples in input file considering comma as field separator. I'm trying in Apache Pig version 0.15.0.

A = LOAD 'data' AS (name: chararray, nodes: tuple());
C = foreach A generate nodes.$0;
dump C;

Output I get is:

(1)
(2)

Upvotes: 0

Donald Miner
Donald Miner

Reputation: 39893

Pig does not accept what you are passing in as valid. The default loading scheme PigStorage only accepts delimited files (by default tab delimited). It is not smart enough to parse the tuple construct with the parenthesis and commas you have in the text. Your options are:

  • Reformat your file to be tab delimited: personA 1 2 3
  • Read the file in line by line with TextLoader, then write some sort of UDF that parses the line and returns the data in the form you want.
  • Write your own custom loader.

Upvotes: 4

Related Questions