Kirk Broadhurst
Kirk Broadhurst

Reputation: 28728

Athena unable to parse date using OpenCSVSerde

I have a very simple csv file on S3

"i","d","f","s"
"1","2018-01-01","1.001","something great!"
"2","2018-01-02","2.002","something terrible!"
"3","2018-01-03","3.003","I'm an oil man"

I'm trying to create a table across this using the following command

CREATE EXTERNAL TABLE test (i int, d date, f  float, s string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");

When I query the table (select * from test) I'm getting an error like this:

HIVE_BAD_DATA:
Error parsing field value '2018-01-01' for field 1: For input string: "2018-01-01"

Some more info:

The documentation definitely implies that this is supported. Looking for anyone who has encountered this, or any suggestions.

Upvotes: 14

Views: 7520

Answers (3)

makson
makson

Reputation: 2273

For columns specified with the timestamp data type in your CREATE TABLE statement, recognizes TIMESTAMP data if it is specified in the UNIX numeric format in milliseconds, such as 1579059880000. Open CSV SerDe for processing CSV

Read how to work with timestamp data

Upvotes: 0

Alexandre
Alexandre

Reputation: 561

In fact, it is a problem with the documentation that you mentioned. You were probably referring to this excerpt:

[OpenCSVSerDe] recognizes the DATE type if it is specified in the UNIX format, such as YYYY-MM-DD, as the type LONG.

Understandably, you were formatting your date as YYYY-MM-DD. However, the documentation is deeply misleading in that sentence. When it refers to UNIX format, it actually has UNIX Epoch Time in mind.

Based on the definition of UNIX Epoch, your dates should be integers (hence the reference to the type LONG in the documentation). Your dates should be the number of days that have elapsed since January 1, 1970.

For instance, your sample CSV should look like this:

"i","d","f","s"
"1","17532","1.001","something great!"
"2","17533","2.002","something terrible!"
"3","17534","3.003","I'm an oil man"

Then you can run that exact same command:

CREATE EXTERNAL TABLE test (i int, d date, f  float, s string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");

If you query your Athena table with select * from test, you will get:

  i       d          f              s           
 --- ------------ ------- --------------------- 
  1   2018-01-01   1.001   something great!     
  2   2018-01-02   2.002   something terrible!  
  3   2018-01-03   3.003   I'm an oil man    

An analogous problem also compromises the explanation on TIMESTAMP in the aforementioned documentation:

[OpenCSVSerDe] recognizes the TIMESTAMP type if it is specified in the UNIX format, such as yyyy-mm-dd hh:mm:ss[.f...], as the type LONG.

It seems to indicate that we should format TIMESTAMPs as yyyy-mm-dd hh:mm:ss[.f...]. Not really. In fact, we need to use UNIX Epoch Time again, but this time with the number of milliseconds that have elapsed since Midnight 1 January 1970.

For instance, consider the following sample CSV:

"i","d","f","s","t"
"1","17532","1.001","something great!","1564286638027"
"2","17533","2.002","something terrible!","1564486638027"
"3","17534","3.003","I'm an oil man","1563486638012"

And the following CREATE TABLE statement:

CREATE EXTERNAL TABLE test (i int, d date, f  float, s string, t timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");

This will be the result set for select * from test:

  i       d          f              s                       t             
 --- ------------ ------- --------------------- ------------------------- 
  1   2018-01-01   1.001   something great!      2019-07-28 04:03:58.027  
  2   2018-01-02   2.002   something terrible!   2019-07-30 11:37:18.027  
  3   2018-01-03   3.003   I'm an oil man        2019-07-18 21:50:38.012  

Upvotes: 18

Tanveer Uddin
Tanveer Uddin

Reputation: 1525

One way around is declare the d column as string and then in the select query use DATE(d) or date_parse to parse the value as date data type.

Upvotes: 3

Related Questions