carlos
carlos

Reputation: 145

ignore malformated json from AWS Athena query

I am trying to get the data of Bogota from OpenAQ in AWS Athena, using the following query.

SELECT *
FROM openaq
WHERE city='Bogota'

I get following referring to a malformed JSON.

Row is not a valid JSON Object - JSONException: Unterminated string at 201 [character 202 line 1]

Is there a way to ignore the rows with malformed JSONs in the query?

Upvotes: 0

Views: 870

Answers (1)

Zander Venter
Zander Venter

Reputation: 21

I have tried adjusting the table template from this source using this line of code as per this thread.

'ignore.malformed.json'='true' 

So the new table template query reads:

CREATE EXTERNAL TABLE `openaq`(
  `date` struct<utc:string,local:string> COMMENT 'from deserializer', 
  `parameter` string COMMENT 'from deserializer', 
  `location` string COMMENT 'from deserializer', 
  `value` float COMMENT 'from deserializer', 
  `unit` string COMMENT 'from deserializer', 
  `city` string COMMENT 'from deserializer', 
  `attribution` array<struct<name:string,url:string>> COMMENT 'from deserializer', 
  `averagingperiod` struct<unit:string,value:float> COMMENT 'from deserializer', 
  `coordinates` struct<latitude:float,longitude:float> COMMENT 'from deserializer', 
  `country` string COMMENT 'from deserializer', 
  `sourcename` string COMMENT 'from deserializer', 
  `sourcetype` string COMMENT 'from deserializer', 
  `mobile` string COMMENT 'from deserializer')
ROW FORMAT SERDE 
'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'ignore.malformed.json'='true') 
STORED AS INPUTFORMAT 
'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://openaq-fetches/realtime-gzipped'
TBLPROPERTIES (
  'transient_lastDdlTime'='1518373755')

This seems to solve the error for me.

Upvotes: 2

Related Questions