Infinite
Infinite

Reputation: 764

Athena table displaying " in column values

I have a csv file in s3 with following structure

"name1"|"tmc International"|"123, link2"

am using below CF template to read this file into Athena

T1Table:
       Type: AWS::Glue::Table
       Properties:
         DatabaseName: 'db_1'
         TableInput:
           Name: 'table_1'
           Description: "table_1"
           TableType: EXTERNAL_TABLE
           Parameters: { "classification" : "csv", "compressionType" : "none", "typeOfData": "file" }
           StorageDescriptor:
             Location: !Sub 's3://temp_loc/temp_csv/'
             InputFormat: 'org.apache.hadoop.mapred.TextInputFormat'
             OutputFormat: 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
             SerdeInfo:
               Parameters:
                 serialization.format: '|'
                 field.delim: '|'
                 skip.header.line.count: '1'
                 quoteChar: "\""
                 #input.regex: '\"([^|]*)\"'
               SerializationLibrary: 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
             Columns:
             - {"Name": "name", "Type": "string"}
             - {"Name": "desc1", "Type": "string"}
             - {"Name": "desc2", "Type": "string"}

When trying to query this data in Athena the columns are showing the quotes(") in the columns , the fields are separated correctly though. Is there any other format which can render the output without the quotes.

Create table DDL

CREATE EXTERNAL TABLE `table_1`(
  `name` string COMMENT 'from deserializer', 
  `desc1` string COMMENT 'from deserializer', 
  `desc2` string COMMENT 'from deserializer', 
  )
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
WITH SERDEPROPERTIES ( 
  'quoteChar'='\"', 
  'separatorChar'='|', 
  'skip.header.line.count'='1') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://temp_loc/temp_csv/'
TBLPROPERTIES (
  'classification'='csv', 
  'compressionType'='none', 
  'typeOfData'='file')

Upvotes: 0

Views: 4227

Answers (3)

Hari Thatavarthy
Hari Thatavarthy

Reputation: 41

If your table has partitions, please verify the partition properties for SerDe info. It may still be showing serialization lib 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' . If so, you will have to use partition API and change it.

Upvotes: 2

Theo
Theo

Reputation: 132862

The default CSV serde (serializer/deserializer) in Athena does not support quoted fields. You need to use the alternative serde: OpenCSVSerDe.

This serde has different names for the configuration parameters. Instead of field.delim you use separatorChar. The quote character is configured by quoteChar, which you already have in your template (it's just not working with the serde you are using).

Your configuration should look something like this:

…
SerdeInfo:
  Parameters:
    separatorChar: '|'
    quoteChar: '"'
    'skip.header.line.count': '1'
  SerializationLibrary: 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
…

Upvotes: 1

Prabhakar Reddy
Prabhakar Reddy

Reputation: 5124

I have used below CF template to create table by looking at your data assuming there is no header in the source file and was able to query it fine.

AWSTemplateFormatVersion: 2010-09-09
Description: Glue Table
Resources:
  T1Table:
       Type: AWS::Glue::Table
       Properties:
         DatabaseName: 'db_1'
         CatalogId: !Ref AWS::AccountId

         TableInput:
           Name: 'table_1'
           Description: "table_1"
           TableType: EXTERNAL_TABLE
           
           StorageDescriptor:
             Location: !Sub 's3://s3path/'
             InputFormat: 'org.apache.hadoop.mapred.TextInputFormat'
             OutputFormat: 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
             SerdeInfo:
               Parameters:
                 quoteChar: "\""
                 separatorChar: "|"
               SerializationLibrary: 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
             Columns:
             - {"Name": "name", "Type": "string"}
             - {"Name": "desc1", "Type": "string"}
             - {"Name": "desc2", "Type": "string"}

This created a table with below definition :

CREATE EXTERNAL TABLE `table_1`(
  `name` string COMMENT 'from deserializer', 
  `desc1` string COMMENT 'from deserializer', 
  `desc2` string COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
WITH SERDEPROPERTIES ( 
  'quoteChar'='\"', 
  'separatorChar'='|') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://s3path/'

Then I ran a select query which gave me expected output as shown below:

enter image description here

Upvotes: 0

Related Questions