Reputation: 43
I'm trying to use the Glue crawler to retrieve an array of strings from my CSV files. Once the crawler crawls my data and generates the data catalog table, I then use Athena to query my data. The problem is Athena (or perhaps it is due to Glue) isn't able to properly extract my array of strings properly.
Here's the format for my CSV files, specifically for the array of strings field. It is enclosed in double quotes:
"['ESLCG', 'PTLEI', 'PTLIS', 'ESBIO', 'FRLRH', 'FRLEH', 'BEZEE', 'NLIJM']"
I manually create the Glue crawler through the Glue console. The data catalog table definition unfortunately recognizes the field as a string rather than an array of strings. What's worse is when I perform an Athena query, Athena returns the above field value as:
"['ESLCG'
The remaining string value then bleeds over to the next field, so the next field's value is:
'PTLEI'
with the remaining fields getting the remaining strings from the array of strings. Consequently, it mangles the handling of the remaining fields.
I'd like to use the built in functionality of Glue/Athena rather than doing anything custom. I've also modified the following serde properties using different values, noting the Glue crawler determines the use of the org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe serializer library:
field.delim - I've tried single and double quotes
collection.delim - ,
colelction.delim - ,
quote.delim - I've tried single and double quotes
Unfortunately, modifying and adding serde properties don't work. Moreover, Glue doesn't allow for custom serde...
What am I doing wrong?
Upvotes: 1
Views: 1000
Reputation: 2064
I was able to get this working by using a pipe separator and comma separated strings in the field defined as array. With this configuration you can dodge quote escaping which seems to break things.
field1|field2a,field2b|field3
Here were my org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe table parameters:
The crawler worked fine and this was queryable in Athena. Note that I did try this with other array types like array and array and those did not work properly.
Upvotes: 0