Reputation: 81
I have an original input json snippet ('/home/user/testsample.json') -
{"key": "somehashvalue","columns": [["Event:2014-03-26 00\\:29\\:13+0200:json","{\"user\":{\"credType\":\"ADDRESS\",\"credValue\":\"01:AA:A4:G1:HH:UU\",\"cAgent\":null,\"cType\":\"ACE\"},\"timestamp\":1395786553,\"sessionId\":1395785353,\"className\":\"Event\",\"subtype\":\"CURRENTLYACTIVE\",\"vType\":\"TEST\",\"vId\":1235080,\"eType\":\"CURRENTLYACTIVE\",\"eData\":\"1\"}",1395786553381001],["Event:2014-03-26 00\\:29\\:13+0200:","",1395786553381001]]}
I tried to use Json serde s to parse the above json to my hive columns. However, 1395786553381001 above is not present in a format which SerDe can map to a Hive column i.e this value is present without a Key (since Hive understands Json columns/values present after :)
So instead I took the Array type approach and created a table -
CREATE TABLE mytesttable (
key string,
columns array < array< string > >
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
LOAD DATA LOCAL INPATH '/home/user/testsample.json'
OVERWRITE INTO TABLE mytesttable;
select columns[0][1] from mytesttable; gives -
{"user":{"credType":"ADDRESS","credValue":"01:AA:A4:G1:HH:UU","cAgent":null,"cType":"ACE"},"timestamp":1395786553,"sessionId":1395785353,"className":"Event","subtype":"CURRENTLYACTIVE","vType":"TEST","vId":1235080,"eType":"CURRENTLYACTIVE","eData":"1"}
The above appears clean, but then I also need columns[*][2] i.e in a Json hive column for further transformations.
I wrote a regex hive query to cleanse the original Json present in '/home/user/testsample.json'
(assume it is present in a table tablewithinputjson)
SELECT
REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(ij.columna, '["][{]', '{'),'[}]["]', '}'), '\\\\', '') AS columna
FROM tablewithinputjson ij;
The above query returns -
{"key": "somehashvalue","columns": [["Event:2014-03-26 00:29:13+0200:json",{"user":{"credType":"ADDRESS","credValue":"01:AA:A4:G1:HH:UU","cAgent":null,"cType":"ACE"},"timestamp":1395786553,"sessionId":1395785353,"className":"Event","subtype":"CURRENTLYACTIVE","vType":"TEST","vId":1235080,"eType":"CURRENTLYACTIVE","eData":"1"},1395786553381001],["Event:2014-03-26 00:29:13+0200:","",1395786553381001]]}
But here again, 1395786553381001 cannot be mapped to a hive column since it appears after , and not after : or more specifically this value is present without a key. (I could add "test": before 1395786553381001 , but I do not want to customize the input data - since a) Too much customization is something I am not comfortable with b) does not seem to be a good solution c) it would unnecessary waste my hadoop cluster space and time)
Not to confuse any further, I am not able to come up with a Hive table format that fully parses and maps all the fields in the original Json snippet. Any suggestions are welcome. Please let me know If it seems too confusing.
Upvotes: 4
Views: 9749
Reputation: 2415
posting End-to-End solution. Step by step procedure to convert JSON to hive table:
step 1) install maven if not there already
>$ sudo apt-get install maven
step 2) install git if not there already
>sudo git clone https://github.com/rcongiu/Hive-JSON-Serde.git
step 3) go into the $HOME/HIVE-JSON_Serde folder
step 4) build the serde package
>sudo mvn -Pcdh5 clean package
step 5) The serde file will be in $HOME/Hive-JSON-Serde/json-serde/target/json-serde-1.3.7-SNAPSHOT-jar-with-dependencies.jar
step 6) Add serde as dependency jar in hive
hive> ADD JAR $HOME/Hive-JSON-Serde/json-serde/target/json-serde-1.3.7- SNAPSHOT-jar-with-dependencies.jar;
step 7) create json file in $HOME/books.json (Example)
{"value": [{"id": "1","bookname": "A","properties": {"subscription": "1year","unit": "3"}},{"id": "2","bookname":"B","properties":{"subscription": "2years","unit": "5"}}]}
step 8) create tmp1 table in hive
hive>CREATE TABLE tmp1 (
value ARRAY<struct<id:string,bookname:string,properties:struct<subscription:string,unit:string>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.value' = 'value'
)
STORED AS TEXTFILE;
step 9) load the data from json to tmp1 table
>LOAD DATA LOCAL INPATH '$HOME/books.json' INTO TABLE tmp1;
step 10) create a tmp2 table to do explode operation form tmp1, this intermediate step is to break multi level json structure into multiple rows Note: if your JSON structure is simple and single level , avoid this step
hive>create table tmp2 as
SELECT *
FROM tmp1
LATERAL VIEW explode(value) itemTable AS items;
step 11) create hive table and load the values from tmp2 table
hive>create table books as
select value[0].id as id, value[0].bookname as name, value[0].properties.subscription as subscription, value[0].properties.unit as unit from tmp2;
step 12) drop tmp tables
hive>drop table tmp1;
hive>drop table tmp2;
step 13) test the hive table
hive>select * from books;
output:
id name subscription unit
1 B 1year 3
2 B 2years 5
Upvotes: 3