Roy
Roy

Reputation: 119

Need to skip CSV header when reading from s3

when i'm trying to load csv file from s3, headers are injecting into columns. i tried to skip header by

   TBLPROPERTIES (
    "skip.header.line.count"="1")

But still no use. Any advice please?

CREATE EXTERNAL TABLE skipheader(
  permalink string, 
  company string, 
  numemps bigint, 
  category string, 
  city string, 
  state string, 
  fundeddate string, 
  raisedamt bigint, 
  raisedcurrency string, 
  round string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://bucketname/filename/'
  TBLPROPERTIES (
    "skip.header.line.count"="1")

Upvotes: 2

Views: 3708

Answers (2)

noahpc
noahpc

Reputation: 386

If you can use the OpenCSV SerDe and make it work for you like described by Theo, go for it. However, if you have other tables of other formats, you can get around it in the following way even though it is a bit of a hack. You can simply add a WHERE clause that excludes the headers like SELECT * FROM skipheader WHERE permalink != 'permalink'. Recently, Athena added the ability to make a new table as result of query (https://docs.aws.amazon.com/athena/latest/ug/create-table-as.html) so if you could even filter out the headers and save to a new location using Athena if that was better for you.

Upvotes: 0

Theo
Theo

Reputation: 132862

Looking at the release notes for when the feature was released it says

Support for ignoring headers. You can use the skip.header.line.count property when defining tables, to allow Athena to ignore headers. This is currently supported for queries that use the OpenCSV SerDe, and not for Grok or Regex SerDes.

My interpretation of this is that it won't work with LazySimpleSerde, which is what you get when you say ROW FORMAT DELIMITED, and that you have to use the OpenCSV serde:

CREATE EXTERNAL TABLE skipheader ( … )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('separatorChar' = ',')
STORED AS TEXTFILE
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucketname/filename/'
TBLPROPERTIES ("skip.header.line.count"="1")

The OpenCSV serde works differently from LazySimpleSerde, it has much more limited data type support, but on the other hand it is more configurable.

Upvotes: 3

Related Questions