Reputation: 3052
I'm trying to create a schema for hive to parse json, however, I am having trouble creating the schema when the json doc is in the following structure:
{
"context": {
"custom": {
"dimensions": [{
"action": "GetFilters"
},
{
"userId": "12345678"
}]
}
}
}
I am using the Hadoop emulator for Azure's HDInsights on windows (8.1) and am using java (1.8.0_73). I compiled the SerDe successfully with Maven. I would think that the following would work:
add jar ../lib/json-serde-1.1.9.9-Hive1.2-jar-with-dependencies.jar;
DROP TABLE events;
CREATE EXTERNAL TABLE events (
context STRUCT<custom:STRUCT<dimensions:array<STRUCT<action:string>,STRUCT<userId:string>>>>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/json/event';
When I take out the nested ARRAY>, then the schema parses ok, but with it in, I get the following exception:
MismatchedTokenException(282!=9) at org.antlr.runtime.BaseRecognizer.recoverFromMismatchedToken(BaseRecog nizer.java:617) at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:115) at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonType(HivePa rser.java:34909) at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonTypeList(Hi veParser.java:33113) at org.apache.hadoop.hive.ql.parse.HiveParser.structType(HiveParser.java :36331) at org.apache.hadoop.hive.ql.parse.HiveParser.type(HiveParser.java:35334 ) at org.apache.hadoop.hive.ql.parse.HiveParser.colType(HiveParser.java:35 054) at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonType(HivePa rser.java:34914) at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonTypeList(Hi veParser.java:33085) at org.apache.hadoop.hive.ql.parse.HiveParser.structType(HiveParser.java :36331) at org.apache.hadoop.hive.ql.parse.HiveParser.type(HiveParser.java:35334 ) at org.apache.hadoop.hive.ql.parse.HiveParser.colType(HiveParser.java:35 054) at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameType(HiveParser. java:34754) at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameTypeList(HivePar ser.java:32951) at org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveP arser.java:4544) at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.ja va:2144) at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.j ava:1398) at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java: 1036) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:19 9) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:16 6) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:409) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:323) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:980) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1045) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:916) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:906) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:2 68) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793 ) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) FAILED: ParseException line 2:69 missing > at ',' near 'STRUCT' in column specif ication line 2:76 mismatched input '<' expecting : near 'STRUCT' in column specification
hive>
Upvotes: 0
Views: 1264
Reputation: 510
I used an array of maps to process the App Insights custom dimensions with rcongiu's JSON SerDe. This is using an HDInsight cluster with a linked blob storage account.
add jar wasb://<mycontainer>@<mystorage>.blob.core.windows.net/json-serde-1.3.7-jar-with-dependencies.jar;
DROP TABLE Logs;
CREATE EXTERNAL TABLE Logs (
event array<struct<
count:int,
name:string>
>,
context struct<
custom:struct<
dimensions:array<map<string, string>>
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 'wasb://<mycontainer>@<mystorage>.blob.core.windows.net/events/';
SELECT
event[0].name as EventName,
context.custom.dimensions[0]['action'] as Action,
context.custom.dimensions[1]['key'] as Key
FROM Logs
WHERE event[0].name = 'Click';
Upvotes: 0
Reputation: 3052
I ended up getting it to work by removing the nested STRUCTs in the ARRAY STRUCT, but I have to access the values with [#]. For example, the following builds the schema:
DROP TABLE events;
CREATE EXTERNAL TABLE events (
context STRUCT<custom:STRUCT<dimensions:ARRAY<STRUCT<action:string,userId:string>>>>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/json/event';
Then I can access items such as the userId like so:
SELECT context.custom.dimensions.userId[1] FROM events;
It works, but is not as readable as I would prefer.
Upvotes: 1
Reputation: 11625
That external table looked good to me. Maybe try downloading another distribution of that JSON serde. I have had success with: http://www.congiu.net/hive-json-serde/
I have had success in HDInsight 3.2 with http://www.congiu.net/hive-json-serde/1.3/cdh5/ but you might try a newer build for HDP.
Documentation here: https://github.com/rcongiu/Hive-JSON-Serde
Upvotes: 0