frazman
frazman

Reputation: 33223

Reading json files in pig

I have three data types...

1) Base data 2) data_dict_1 3) data_dict_2

Base data is very well formatted json.. For example:

{"id1":"foo", "id2":"bar" ,type:"type1"}
{"id1":"foo", "id2":"bar" ,type:"type2"}

data_dict_1

1 foo
2 bar
3 foobar
....

data_dict_2

-1 foo
-2 bar
-3 foobar
... and so on

Now, what I want is.. if the data is of type1

Then read id1 from data_dict_1, id2 from data_dict2 and assign that integer id.. If data is of type2.. then read id1 from data_dict_2.. id2 from data_dict1.. and assign corresponding ids.. For example:

{"id1":1, "id2":2 ,type:"type1"}
{"id1":-1, "id2":-2 ,type:"type2"}

And so on.. How do i do this in pig?

Upvotes: 0

Views: 641

Answers (1)

TC1
TC1

Reputation: 1

Note: what you have in the upper example is not valid json, the type key is not quoted.

Assuming Pig 0.10 and up, there's the JsonLoader built-in, which you can pass a schema to and load it with

data = LOAD 'loljson' USING JsonLoader('id1:chararray,id2:chararray,type:chararray');

and load the dicts

dict_1 = LOAD 'data_dict_1' USING PigStorage(' ') AS (id:int, key:chararray);
dict_2 = LOAD 'data_dict_2' USING PigStorage(' ') AS (id:int, key:chararray);

Then split that based on the type value

SPLIT data INTO type1 IF type == 'type1', type2 IF type == 'type2';

JOIN them appropriately

type1_joined = JOIN type1 BY id1, dict_1 BY key;
type1_joined = FOREACH type1_joined GENERATE type1::id1 AS id1, type1::id2 AS id2, type1::type AS type, dict_1::id AS id;

type2_joined = JOIN type2 BY id2, dict_2 BY key;
type2_joined = FOREACH type2_joined GENERATE type2::id1 AS id1, type2::id2 AS id2, type2::type AS type, dict_2::id AS id;

and since the schemas are equal, UNION them together

final_data = UNION type1_joined, type2_joined;

this produces

DUMP final_data;

(foo,bar,type2,-2)
(foo,bar,type1,1)

Upvotes: 1

Related Questions