Reputation: 12031
I have some data in an interesting format. It's like this:
ID Name Info
1 Joe quality=82,activity=23,age:rank:foo,21:0/1:30
2 Bob activity=32,age:foo,22:31,quality=43
3 Mary foo:age:rank,24:23:1/1,quality=62,activity=14
what I'm trying to extract are the age:rank:foo data which is represented as:
age:rank:foo,(age value):(rank value):(foo value)
For every record they may or may not have some or all of those keys/values and they may be in any order. I'm trying to convert the data so I get the following:
ID Name Quality Age Rank Foo
1 Joe 82 21 0/1 30
2 Bob 43 22 NULL 31
3 Mary 62 23 1/1 24
For getting the quality I'm doing something like this:
A = LOAD '...' ...;
B = FOREACH A GENERATE ID, Name, REGEX_EXTRACT(info, 'quality=([0-9]+),', 1) AS Quality;
But I have no idea how to convert the strange nonuniform key/value map into columns. Any suggestions? (And no I can't get them to send me the data in a better format :P )
Upvotes: 0
Views: 103
Reputation: 1316
You can use REGEX_EXTRACT
/REGEX_EXTRACT_ALL
for this, but it's cumbersome. Since there can be one to three age/rank/foo fields in any order, you need 6 regexes for instances with all 3 fields to cover all permutations, another 6 for instances with 2 fields, and 3 more for a single field.
You get 11 fields each for age/rank/foo from this (6 from the 3-field extractions, 4 from the 2-field extractions, 1 from the single field extraction). For each of age/rank/foo, pick the one that is non-null as your result (e.g. (foo1 is not null ? foo1 : (foo2 is not null ? foo2 : (foo3 is not null ? foo3 : ... ))) as foo_final
).
Upvotes: 1