Reputation: 201
I have daily csv files in AWS S3 partitioned by month and year. The csv files have all the data enclosed in double quotes e.g.
"Time","ItemCode","Volume","Available"
"1/07/2019 12:00:00 AM","12148A","71","True"
"1/07/2019 12:00:00 AM","12149Z","23","False"
When I create a standard crawler over this data, the schema has interpreted the following data types:
Time: string
ItemCode: string
Volume: bigint
Available: boolean
Input format: org.apache.hadoop.mapred.TextInputFormat
Output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Serde serialization lib: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Serde parameters: field.delim ,
However when I query this data in Athena via "select * ...", I can only see the data in the Time and ItemCode columns (and partition columns). All the other columns are empty. So I get the result:
Time ItemCode Volume Available Month Year
"1/07/2019 12:00:00 AM" "12148A" 07 2019
"1/07/2019 12:00:00 AM" "12149Z" 07 2019
I'm not sure what's going on. It seems that only the string data shows within Athena.
I also want 'Time' to have type 'timestamp'. I'm not sure how to correct all this. I have to rerun the crawler every month, so there is no point in manually editing the schema, as a rerun will reset the schema.
Do I have to create a Glue job to strip all the double quotes from the csv files and save them to new csv files, and crawl over these new files?
Thanks for any advice on how to fix this problem, or how to handle this the best way.
Upvotes: 2
Views: 6411
Reputation: 6831
The culprit is most likely LazySimpleSerDe which Glue configures for CSV files. It does not support quoted fields. OpenCSVSerDe does support quoted fields so configuring it for your table should fix the issue.
You can manually overwrite the SerDe library in the table configuration to use org.apache.hadoop.hive.serde2.OpenCSVSerde
and you'll probably want to configure the Crawler to not revert that manual change (as described here).
Upvotes: 0
Reputation: 132862
I don't know why Glue does this, but Glue Crawlers are unfortunately very often very bad at doing the right thing. The issue is that LazySimpleSerDe
, the serde (data serializer/deserializer) that Glue picked does not support quoted fields.
You might get it to work by manually changing the table to use OpenCSVSerDe
instead – but I guess that Glue would just flip it back when it runs the crawler the next time.
My suggestion is to stop using Glue Crawlers. There's a feature that was launched very recently called Partition Projection that I think could solve your problem in a much more convenient way. If you post the S3 URI of one of your partitions I can show you how to set it up.
Your second issue is that you want the Time
column to be a TIMESTAMP
. This is unfortunately only possible if the data is in a supported date/time format, which unfortunately yours is not. See Supported Data Types for more information.
You will have to parse the strings using for example date_parse
in queries. To avoid having to do it in all queries you can create a view that parses the dates and that your queries use instead.
Upvotes: 4