Reputation: 33223
I have three data types...
1) Base data 2) data_dict_1 3) data_dict_2
Base data is very well formatted json.. For example:
{"id1":"foo", "id2":"bar" ,type:"type1"}
{"id1":"foo", "id2":"bar" ,type:"type2"}
data_dict_1
1 foo
2 bar
3 foobar
....
data_dict_2
-1 foo
-2 bar
-3 foobar
... and so on
Now, what I want is.. if the data is of type1
Then read id1 from data_dict_1, id2 from data_dict2 and assign that integer id.. If data is of type2.. then read id1 from data_dict_2.. id2 from data_dict1.. and assign corresponding ids.. For example:
{"id1":1, "id2":2 ,type:"type1"}
{"id1":-1, "id2":-2 ,type:"type2"}
And so on.. How do i do this in pig?
Upvotes: 0
Views: 641
Reputation: 1
Note: what you have in the upper example is not valid json, the type
key is not quoted.
Assuming Pig 0.10 and up, there's the JsonLoader built-in, which you can pass a schema to and load it with
data = LOAD 'loljson' USING JsonLoader('id1:chararray,id2:chararray,type:chararray');
and load the dicts
dict_1 = LOAD 'data_dict_1' USING PigStorage(' ') AS (id:int, key:chararray);
dict_2 = LOAD 'data_dict_2' USING PigStorage(' ') AS (id:int, key:chararray);
Then split that based on the type
value
SPLIT data INTO type1 IF type == 'type1', type2 IF type == 'type2';
JOIN
them appropriately
type1_joined = JOIN type1 BY id1, dict_1 BY key;
type1_joined = FOREACH type1_joined GENERATE type1::id1 AS id1, type1::id2 AS id2, type1::type AS type, dict_1::id AS id;
type2_joined = JOIN type2 BY id2, dict_2 BY key;
type2_joined = FOREACH type2_joined GENERATE type2::id1 AS id1, type2::id2 AS id2, type2::type AS type, dict_2::id AS id;
and since the schemas are equal, UNION
them together
final_data = UNION type1_joined, type2_joined;
this produces
DUMP final_data;
(foo,bar,type2,-2)
(foo,bar,type1,1)
Upvotes: 1