Reputation: 953
I have been using the below query to create a table within Athena,
CREATE EXTERNAL TABLE IF NOT EXISTS test.test_table (
`converteddate` string,
`userid` string,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3:XXXX'
TBLPROPERTIES ('has_encrypted_data'='false',"skip.header.line.count"="1")
This returns me:
converteddate | userid
-------------------------------------
2017-11-29T05:00:00 | 00001
2017-11-27T04:00:00 | 00002
2017-11-26T03:00:00 | 00003
2017-11-25T02:00:00 | 00004
2017-11-24T01:00:00 | 00005
I would like to return:
converteddate | userid
-------------------------------------
2017-11-29 05:00:00 | 00001
2017-11-27 04:00:00 | 00002
2017-11-26 03:00:00 | 00003
2017-11-25 02:00:00 | 00004
2017-11-24 01:00:00 | 00005
and have converteddate as a datetime and not a string.
Upvotes: 3
Views: 2810
Reputation: 561
(My answer has one premise: you are using OpenCSVSerDe. It doesn't apply to LazySimpleSerDe, for instance.)
If you have the option of changing the format of your input CSV file, you should convert your timestamp to UNIX Epoch Time. That's the format that OpenCSVSerDe is expecting.
For instance, your sample CSV looks like this:
"converteddate","userid"
"2017-11-29T05:00:00","00001"
"2017-11-27T04:00:00","00002"
"2017-11-26T03:00:00","00003"
"2017-11-25T02:00:00","00004"
"2017-11-24T01:00:00","00005"
It should be:
"converteddate","userid"
"1511931600000","00001"
"1511755200000","00002"
"1511665200000","00003"
"1511575200000","00004"
"1511485200000","00005"
Those integers are the number of milliseconds since Midnight January 1, 1970 for each one of your original dates.
Then you can run a slightly modified version of your CREATE TABLE
statement:
CREATE EXTERNAL TABLE IF NOT EXISTS test.test_table (
converteddate timestamp,
userid string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3:XXXX'
TBLPROPERTIES ("skip.header.line.count"="1");
If you query your Athena table with select * from test_table
, this will be the result:
converteddate userid
------------------------- --------
2017-11-29 05:00:00.000 00001
2017-11-27 04:00:00.000 00002
2017-11-26 03:00:00.000 00003
2017-11-25 02:00:00.000 00004
2017-11-24 01:00:00.000 00005
As you can see, type TIMESTAMP on Athena includes milliseconds.
I wrote a more comprehensive explanation on using types TIMESTAMP and DATE with OpenCSVSerDe. You can read it here.
Upvotes: 1
Reputation: 1076
It is not possible to convert the data while table creation. But you can get the data while querying. You can use date_parse(string,format) -> timestamp function. More details are mentioned here.
For your usecase you can do something like as follows
select date_parse(converteddate, '%y-%m-%dT%H:%i:%s') as converted_timestamp, userid
from test_table
Note : Based on type of your string you have to choose proper specifier for month(always two digits or not), day, hour(12 or 24 hours format), etc
Upvotes: 4