Reputation: 3
im trying to create a AWS Athena table using RegexSerDe.. due to some export issues i cannot use JsonSerDe.
2019-04-11T09:05:16.775Z {"timestamp":"data0","level":"data1","thread":data2","logger":"data3","message":"data4","context":"data5"}
I was trying to obtain json values with a regex, but without any luck.
CREATE EXTERNAL TABLE IF NOT EXISTS dsfsdfs.mecs3(
`timestamp` string,
`level` string,
`thread` string,
`logger` string,
`message` string,
`context` string
)
)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "[ :]+(\\"[^\"]*\\")"
)LOCATION 's3://thisisates/'
Error: HIVE_CURSOR_ERROR: Number of matching groups doesn't match the number of columns
Would be great some help as i'm not an expert in regex.
Thanks and BR.
Upvotes: 0
Views: 956
Reputation: 132932
Getting this working will probably be very hard - even if you can write a regex that will capture the columns out of the JSON structure, can you guarantee that all JSON documents will be rendered with the properties in the same order? JSON itself considers {"a": 1, "b": 2}
and {"b": 2, "a": 1}
to be equivalent, so many JSON libraries don't guarantee, or even care about ordering.
Another approach to this is to create a table with two columns: timestamp
and data
, as a regex table with a regex with two capture groups, the timestamp and the rest of the line – or possibly as a CSV table if the character after the timestamp is a tab (if it's a space it won't work since the JSON will contain spaces):
CREATE EXTERNAL TABLE IF NOT EXISTS mecs3_raw (
`timestamp` string,
`data` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(\\S+) (.+)$"
)
LOCATION 's3://thisisates/'
(the regex assumes that there is a space between the timestamp and the JSON structure, change it as needed).
That table will not be very usable by itself, but what you can do next is to create a view that extracts the properties from the JSON structure:
CREATE VIEW mecs3 AS
SELECT
"timestamp",
JSON_EXTRACT_SCALAR("data", '$.level') AS level,
JSON_EXTRACT_SCALAR("data", '$.thread') AS thread,
JSON_EXTRACT_SCALAR("data", '$.logger') AS logger,
JSON_EXTRACT_SCALAR("data", '$.message') AS message,
JSON_EXTRACT_SCALAR("data", '$.context') AS context
FROM mecs3_raw
(mecs3_raw
is the table with timestamp
and data
columns)
This will give you what you want and will be much less error prone.
Upvotes: 1