Reputation: 764
I have a csv file in s3 with following structure
"name1"|"tmc International"|"123, link2"
am using below CF template to read this file into Athena
T1Table:
Type: AWS::Glue::Table
Properties:
DatabaseName: 'db_1'
TableInput:
Name: 'table_1'
Description: "table_1"
TableType: EXTERNAL_TABLE
Parameters: { "classification" : "csv", "compressionType" : "none", "typeOfData": "file" }
StorageDescriptor:
Location: !Sub 's3://temp_loc/temp_csv/'
InputFormat: 'org.apache.hadoop.mapred.TextInputFormat'
OutputFormat: 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
SerdeInfo:
Parameters:
serialization.format: '|'
field.delim: '|'
skip.header.line.count: '1'
quoteChar: "\""
#input.regex: '\"([^|]*)\"'
SerializationLibrary: 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
Columns:
- {"Name": "name", "Type": "string"}
- {"Name": "desc1", "Type": "string"}
- {"Name": "desc2", "Type": "string"}
When trying to query this data in Athena the columns are showing the quotes(") in the columns , the fields are separated correctly though. Is there any other format which can render the output without the quotes.
Create table DDL
CREATE EXTERNAL TABLE `table_1`(
`name` string COMMENT 'from deserializer',
`desc1` string COMMENT 'from deserializer',
`desc2` string COMMENT 'from deserializer',
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'quoteChar'='\"',
'separatorChar'='|',
'skip.header.line.count'='1')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://temp_loc/temp_csv/'
TBLPROPERTIES (
'classification'='csv',
'compressionType'='none',
'typeOfData'='file')
Upvotes: 0
Views: 4227
Reputation: 41
If your table has partitions, please verify the partition properties for SerDe info. It may still be showing serialization lib 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' . If so, you will have to use partition API and change it.
Upvotes: 2
Reputation: 132862
The default CSV serde (serializer/deserializer) in Athena does not support quoted fields. You need to use the alternative serde: OpenCSVSerDe.
This serde has different names for the configuration parameters. Instead of field.delim
you use separatorChar
. The quote character is configured by quoteChar
, which you already have in your template (it's just not working with the serde you are using).
Your configuration should look something like this:
…
SerdeInfo:
Parameters:
separatorChar: '|'
quoteChar: '"'
'skip.header.line.count': '1'
SerializationLibrary: 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
…
Upvotes: 1
Reputation: 5124
I have used below CF template to create table by looking at your data assuming there is no header in the source file and was able to query it fine.
AWSTemplateFormatVersion: 2010-09-09
Description: Glue Table
Resources:
T1Table:
Type: AWS::Glue::Table
Properties:
DatabaseName: 'db_1'
CatalogId: !Ref AWS::AccountId
TableInput:
Name: 'table_1'
Description: "table_1"
TableType: EXTERNAL_TABLE
StorageDescriptor:
Location: !Sub 's3://s3path/'
InputFormat: 'org.apache.hadoop.mapred.TextInputFormat'
OutputFormat: 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
SerdeInfo:
Parameters:
quoteChar: "\""
separatorChar: "|"
SerializationLibrary: 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
Columns:
- {"Name": "name", "Type": "string"}
- {"Name": "desc1", "Type": "string"}
- {"Name": "desc2", "Type": "string"}
This created a table with below definition :
CREATE EXTERNAL TABLE `table_1`(
`name` string COMMENT 'from deserializer',
`desc1` string COMMENT 'from deserializer',
`desc2` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'quoteChar'='\"',
'separatorChar'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://s3path/'
Then I ran a select query which gave me expected output as shown below:
Upvotes: 0