double quotes on all keys and all values in athena query result

Question

Preface:

I defined an Athena table in AWS, using s3 as the source (defined it manually without glue crawler). The files contain data from Eventbridge, and each record in them is separated with newline by Kinesis Firehose (so they are in the json format required by Athena).

The actual data (in column detail) is a very deeply nested json with many varying keys (which is why I didn't use Glue crawler). To overcome this, I defined all the columns as string, and left the user (when sending the query) to be familiar with the keys of interest.

It all seem to work and I'm able to query the table and find the data I expect to find, except for one problem.

.

THE Problem:

Athena seem to make sure that each and every key and value in the detail json with double quotes. Note that it doesn't simply add, but rather make sure that they're there. So, json strings don't get additional quotes, but all other type do (false becomes "false", -0.12 becomes "-0.12", but "some string" remains "some string").

Any idea why this is happening?

.

Additional info:

.

Even weirder, when I check the result file that Athena dumps in s3, I see that it added another pair of double quotes around each and every key and value. So my result csv file looks something like (column 1 is a json in the select, column 2 is a number):

query_select_col_1,query_select_col_2
"{""date_key"":""2022-04-05T09:48:23Z"",""num_key"":""-0.12""}",12

.

My table DDL is as follows:

CREATE EXTERNAL TABLE `table_name`(
  `version` string COMMENT 'from deserializer', 
  `id` string COMMENT 'from deserializer', 
  `detail-type` string COMMENT 'from deserializer', 
  `source` string COMMENT 'from deserializer', 
  `account` string COMMENT 'from deserializer', 
  `time` string COMMENT 'from deserializer', 
  `region` string COMMENT 'from deserializer', 
  `resources` string COMMENT 'from deserializer', 
  `detail` string COMMENT 'from deserializer')
PARTITIONED BY ( 
  `year` string, 
  `month` string, 
  `day` string)

ROW FORMAT SERDE 
    'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 
    'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
    'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
  's3://bucket-name/pref1/pref2/table_name/'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'transient_lastDdlTime'='1649176009')

.

Input data in s3 looks something like:

{"date_key":"2022-04-05T09:48:23Z","num_key":-0.12,"bool_key":false}

double quotes on all keys and all values in athena query result

Answers (1)

Related Questions