Empty columns in Athena for Glue crawler processed CSV data enclosed in double quotes

Question

I have daily csv files in AWS S3 partitioned by month and year. The csv files have all the data enclosed in double quotes e.g.

"Time","ItemCode","Volume","Available"
"1/07/2019 12:00:00 AM","12148A","71","True"
"1/07/2019 12:00:00 AM","12149Z","23","False"

When I create a standard crawler over this data, the schema has interpreted the following data types:

Time:      string
ItemCode:  string
Volume:    bigint
Available: boolean

Input format:            org.apache.hadoop.mapred.TextInputFormat
Output format:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Serde serialization lib: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Serde parameters:        field.delim ,

However when I query this data in Athena via "select * ...", I can only see the data in the Time and ItemCode columns (and partition columns). All the other columns are empty. So I get the result:

Time                      ItemCode   Volume   Available   Month   Year
"1/07/2019 12:00:00 AM"   "12148A"                        07      2019
"1/07/2019 12:00:00 AM"   "12149Z"                        07      2019

I'm not sure what's going on. It seems that only the string data shows within Athena.

I also want 'Time' to have type 'timestamp'. I'm not sure how to correct all this. I have to rerun the crawler every month, so there is no point in manually editing the schema, as a rerun will reset the schema.

Do I have to create a Glue job to strip all the double quotes from the csv files and save them to new csv files, and crawl over these new files?

Thanks for any advice on how to fix this problem, or how to handle this the best way.

Theo · Accepted Answer

I don't know why Glue does this, but Glue Crawlers are unfortunately very often very bad at doing the right thing. The issue is that LazySimpleSerDe, the serde (data serializer/deserializer) that Glue picked does not support quoted fields.

You might get it to work by manually changing the table to use OpenCSVSerDe instead – but I guess that Glue would just flip it back when it runs the crawler the next time.

My suggestion is to stop using Glue Crawlers. There's a feature that was launched very recently called Partition Projection that I think could solve your problem in a much more convenient way. If you post the S3 URI of one of your partitions I can show you how to set it up.

Your second issue is that you want the Time column to be a TIMESTAMP. This is unfortunately only possible if the data is in a supported date/time format, which unfortunately yours is not. See Supported Data Types for more information.

You will have to parse the strings using for example date_parse in queries. To avoid having to do it in all queries you can create a view that parses the dates and that your queries use instead.

Empty columns in Athena for Glue crawler processed CSV data enclosed in double quotes

Answers (2)

Related Questions