Reputation: 603
Preface:
I defined an Athena table in AWS, using s3 as the source (defined it manually without glue crawler). The files contain data from Eventbridge, and each record in them is separated with newline by Kinesis Firehose (so they are in the json format required by Athena).
The actual data (in column detail
) is a very deeply nested json with many varying keys (which is why I didn't use Glue crawler). To overcome this, I defined all the columns as string, and left the user (when sending the query) to be familiar with the keys of interest.
It all seem to work and I'm able to query the table and find the data I expect to find, except for one problem.
.
.
THE Problem:
Athena seem to make sure that each and every key and value in the detail
json with double quotes. Note that it doesn't simply add, but rather make sure that they're there. So, json strings don't get additional quotes, but all other type do (false
becomes "false"
, -0.12
becomes "-0.12"
, but "some string"
remains "some string"
).
Any idea why this is happening?
.
.
Additional info:
.
Even weirder, when I check the result file that Athena dumps in s3, I see that it added another pair of double quotes around each and every key and value. So my result csv file looks something like (column 1 is a json in the select, column 2 is a number):
query_select_col_1,query_select_col_2
"{""date_key"":""2022-04-05T09:48:23Z"",""num_key"":""-0.12""}",12
.
My table DDL is as follows:
CREATE EXTERNAL TABLE `table_name`(
`version` string COMMENT 'from deserializer',
`id` string COMMENT 'from deserializer',
`detail-type` string COMMENT 'from deserializer',
`source` string COMMENT 'from deserializer',
`account` string COMMENT 'from deserializer',
`time` string COMMENT 'from deserializer',
`region` string COMMENT 'from deserializer',
`resources` string COMMENT 'from deserializer',
`detail` string COMMENT 'from deserializer')
PARTITIONED BY (
`year` string,
`month` string,
`day` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://bucket-name/pref1/pref2/table_name/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1649176009')
.
Input data in s3 looks something like:
{"date_key":"2022-04-05T09:48:23Z","num_key":-0.12,"bool_key":false}
Upvotes: 0
Views: 3963
Reputation: 132862
The format of Athena result sets is CSV with quoted fields. CSV is a loosely defined format with lots of variations.
In JSON the type of a value is defined by syntax. Strings are quoted, numbers and booleans are not, and so on. CSV has no type information. The quotes are not there to describe types, but to ensure that fields can contain the field separator (e.g. to make it clear that this is two and not three fields: "a,b","c"
). The fields of a CSV are neither strings, numbers, booleans, they have no type at all.
The repeated double quotes is another quirk of (some variants of) CSV. The quotes are to allow fields to contain the field separator, but what if fields need to contain the quote character? Double it! (e.g. """a""","b"
is two fields, one containing "a"
and the other containing b
– note though that the quotes in the JSON literal "a"
aren't part of the string, they are syntax, JSON uses backslash to escape the quote character, "\"a\""
is the string "a"
).
There are other variants of CSV, some use escape characters like \
to escape the quote characters, some only quote fields that contain the field separator, etc. but this variant with all fields quoted and escaping the quote character by doubling is the one used by Athena.
It has nothing to do with the source data, if it's JSON, Avro, Parquet, or CSV, the result format is always the same.
Upvotes: 2