chrislondon
chrislondon

Reputation: 12031

Expand delimited columns to dynamic values in Apache Pig

I have some data in an interesting format. It's like this:

ID    Name    Info
1     Joe     quality=82,activity=23,age:rank:foo,21:0/1:30
2     Bob     activity=32,age:foo,22:31,quality=43
3     Mary    foo:age:rank,24:23:1/1,quality=62,activity=14

what I'm trying to extract are the age:rank:foo data which is represented as:

age:rank:foo,(age value):(rank value):(foo value)

For every record they may or may not have some or all of those keys/values and they may be in any order. I'm trying to convert the data so I get the following:

ID    Name    Quality  Age    Rank    Foo
1     Joe     82       21     0/1     30
2     Bob     43       22     NULL    31
3     Mary    62       23     1/1     24

For getting the quality I'm doing something like this:

A = LOAD '...' ...;
B = FOREACH A GENERATE ID, Name, REGEX_EXTRACT(info, 'quality=([0-9]+),', 1) AS Quality;

But I have no idea how to convert the strange nonuniform key/value map into columns. Any suggestions? (And no I can't get them to send me the data in a better format :P )

Upvotes: 0

Views: 103

Answers (1)

user2303197
user2303197

Reputation: 1316

You can use REGEX_EXTRACT/REGEX_EXTRACT_ALL for this, but it's cumbersome. Since there can be one to three age/rank/foo fields in any order, you need 6 regexes for instances with all 3 fields to cover all permutations, another 6 for instances with 2 fields, and 3 more for a single field.

You get 11 fields each for age/rank/foo from this (6 from the 3-field extractions, 4 from the 2-field extractions, 1 from the single field extraction). For each of age/rank/foo, pick the one that is non-null as your result (e.g. (foo1 is not null ? foo1 : (foo2 is not null ? foo2 : (foo3 is not null ? foo3 : ... ))) as foo_final).

Upvotes: 1

Related Questions