Reputation: 119
when i'm trying to load csv file from s3, headers are injecting into columns. i tried to skip header by
TBLPROPERTIES (
"skip.header.line.count"="1")
But still no use. Any advice please?
CREATE EXTERNAL TABLE skipheader(
permalink string,
company string,
numemps bigint,
category string,
city string,
state string,
fundeddate string,
raisedamt bigint,
raisedcurrency string,
round string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucketname/filename/'
TBLPROPERTIES (
"skip.header.line.count"="1")
Upvotes: 2
Views: 3708
Reputation: 386
If you can use the OpenCSV SerDe and make it work for you like described by Theo, go for it. However, if you have other tables of other formats, you can get around it in the following way even though it is a bit of a hack. You can simply add a WHERE clause that excludes the headers like
SELECT * FROM skipheader WHERE permalink != 'permalink'
. Recently, Athena added the ability to make a new table as result of query (https://docs.aws.amazon.com/athena/latest/ug/create-table-as.html) so if you could even filter out the headers and save to a new location using Athena if that was better for you.
Upvotes: 0
Reputation: 132862
Looking at the release notes for when the feature was released it says
Support for ignoring headers. You can use the skip.header.line.count property when defining tables, to allow Athena to ignore headers. This is currently supported for queries that use the OpenCSV SerDe, and not for Grok or Regex SerDes.
My interpretation of this is that it won't work with LazySimpleSerde, which is what you get when you say ROW FORMAT DELIMITED
, and that you have to use the OpenCSV serde:
CREATE EXTERNAL TABLE skipheader ( … )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('separatorChar' = ',')
STORED AS TEXTFILE
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucketname/filename/'
TBLPROPERTIES ("skip.header.line.count"="1")
The OpenCSV serde works differently from LazySimpleSerde, it has much more limited data type support, but on the other hand it is more configurable.
Upvotes: 3