Simon Rex
Simon Rex

Reputation: 183

Athena queries returning quotes in values

I have a CSV file with data that looks like "John Doe",Washington,100,22,.... The values that have quotes around them are the ones that contain whitespace. The values that don't have quotes don't have whitespace. The data has been processed by an AWS Glue Crawler, and when queried by AWS Athena, it returns all values, including the quotes. I don't want the quotes returned in my queries. I've tried looking at https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#schema-csv to try and fix this problem. However, that method only works if all values in the CSV contain quotes around them. Is there any way to fix this problem?

CREATE EXTERNAL TABLE `weatherdata_output`(
  `name` string, 
  `state` string, 
  `lat` double, 
  `lng` double,
  ...)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<bucket>/output'
TBLPROPERTIES (
  'transient_lastDdlTime'='1630559293')
 

Upvotes: 2

Views: 11374

Answers (2)

schiavuzzi
schiavuzzi

Reputation: 1072

I gave up dealing with the peculiarities of AWS Glue Crawler's CSV deserializer(s), and other schema-less file formats as well.

Instead, I prefer to convert the CSV (or other file formats) datasets into Parquet first - it's very simple to do so, for example via this simple Python script:

import pandas as pd

df = pd.read_csv('dataset.csv', sep=',', dtype=str)
df.info()  # Print the autodetected schema for verification and further tuning
print(df.head(30))  # Print a sample of dataframe records, for verification
print(f"records count = {df.count()}")

df.to_parquet('dataset.parquet')

This way I can have full control of my schema and even enforce a custom schema if not happy with Pandas's autodetected one. As a bonus, Parquet is faster for querying than CSV. Avro and ORC are good alternatives too, each one is better optimized for a specific use case (for example, columnar vs row data access). Most importantly, all of these formats support embedding the schema inside the file itself, unlike CSV and similar dumb plain file formats.

CSV files (and other schema-less file formats) should really burn in hell and never be used for data exchange purposes :-)

Make your life easier and simpler and just switch to a smarter format which supports embedded schemas, such as Parquet, ORC, or Avro. Glue Crawlers work perfectly with all of those, and you'll save yourself lots of time and quite a few headaches.

Upvotes: 0

John Rotenstein
John Rotenstein

Reputation: 269151

To test your situation, I uploaded this data file:

"John Doe",Washington,100,22
"Peter Smith",Sydney,200,88
"Mary Moss",Tokyo,300,44

I then raw the Glue crawler, and it gave a similar DDL to yours. When I queried the data, it also had the problem with include quotation marks in the first column.

I then ran this command instead:

CREATE EXTERNAL TABLE `city2`(
  `col0` string, 
  `col1` string, 
  `col2` bigint, 
  `col3` bigint)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION
  's3://bucket/city/'

Then, when I queried the table, it came out correctly:

Athena query results

Therefore, I would recommend using ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' as the table format.

Upvotes: 3

Related Questions