Skiddles
Skiddles

Reputation: 147

HIVE SQL to extract Record as JSON

I have a table in Hive with multiple fields. For example

-------------------------------------------------------------
|Primary Key|Attribute 1|Attribute 2|Attribute 3|Attribute 4|
-------------------------------------------------------------

I need to query the table using HIVE SQL and return a JSON object embedded in a CSV format. For example:

Primary Key, Attribute 1, {"Primary Key":"", "Attribute 2":"",  "Attribute 2":""}, Attribute 4

I don't have any post query programming language to convert the values to JSON format. I have no problem with straight SQL, I just don't know if I can get the JSON object out using sql.

Any ideas greatly appreciated.

Upvotes: 2

Views: 4081

Answers (1)

Kumar Rohit
Kumar Rohit

Reputation: 507

I used brickhouse JAR (brickhouse-0.6.0.jar) for JSON output to achieve something like this. This is the best jar available, but if we need the keys in the JSON to preserve camelcases, then an extra elemennt need to be put in the to_json(named_struct()) clause.

You can read more about this JAR here. Below is the code snippet on how I did it.

ADD JAR path/to/jar/brickhouse-0.6.0.jar;
CREATE TEMPORARY FUNCTION to_json AS 'brickhouse.udf.json.ToJsonUDF'; 

CREATE TABLE IF NOT EXISTS V2
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ','
LINES TERMINATED BY '\n'
SELECT
    empId,
    to_json(named_struct('camel_case', 1, 'employee_id', employeeId, 'manager_id', msgId, 'org_Id', orgId), true) AS jsonString
FROM 
    employee
WHERE 
    employeeId=101
);

The output would be like this:

101, {"camelCase" : true, "employeeId" : 101, "managerId" : 201, "orgId" : 301}

I had to handle the case where I had to deal with Array of JSON elements which needed to be in square brackets([]). You can ignore outermost layer in the below query (i.e. SELECT FROM y and just use the queries till SELECT FROM x) if you are sure you have single JSON per record. Query change for handling Array of JSON.

CREATE TABLE IF NOT EXISTS V2
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ','
LINES TERMINATED BY '\n'
AS 
SELECT 
    y.employeeId, CONCAT('[', y.jsonData, ']') AS jsonData
FROM 
(
    SELECT 
        x.employeeId, collect_list(jsonString) AS jsonData 
    FROM (
    SELECT
    empId,
    to_json(named_struct('camel_case', 1, 'employee_id', employeeId, 'manager_id', msgId, 'org_Id', orgId), true) AS jsonString
    FROM 
        employee
    WHERE 
        employeeId=101
    ) x
    GROUP BY 
        x.employeeId
) y;

The output would be like this:

101, [{"camelCase" : true, "employeeId" : 101, "managerId" : 201, "orgId" : 301}]

You can tweak this query to put the data in HDFS directly using this version:

INSERT OVERWRITE DIRECTORY '/path/of/target/directory/'
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ','
LINES TERMINATED BY '\n'AS 
SELECT ...

Upvotes: 1

Related Questions