MonkeyBonkey
MonkeyBonkey

Reputation: 47861

how to flatten, denormalize in pig

I'd like to create a flattened join table from the following schema

   titles = FOREACH programs GENERATE (px.pig.udf.PARSE_KEYWORDS(program_xml))
    AS program:
        (root_id: long, 
        ids:bag {(idtype:chararray, idvalue:chararray)}, 
        keywords:bag {(keytype:chararray,keyvalue:chararray)});

if the input is

(1, {('x','foo'),('y','bar')},{})
(2, {('x','fiz'),('y','buzz')},{})
(2, {('x','moo')},{})
...

The output should be something like:

root_id    idvalue
1          foo
1          bar
2          fiz
2          buzz
3          moo

How would I do that in pig?

Upvotes: 0

Views: 316

Answers (1)

Pracheer Agarwal
Pracheer Agarwal

Reputation: 131

  1. Project first two columns.

    x = foreach titles generate root_id, ids;

  2. flatten on the second column.

    y = foreach x generate root_id, FLATTEN(ids) as (idtype:chararray, idvalue:chararray);

This will give you the result in the following format: root_id idtype idvalue
1 x foo

1 y bar

2 x fiz

2 y buzz

3 x moo

Project first and third column to get the required result.

Upvotes: 2

Related Questions