Reputation: 875
I am trying to create an external table using Hive. Below is the Hive query I ran:
create external table trips_raw
(
VendorID int,
tpep_pickup_datetime timestamp,
tpep_dropoff_datetime timestamp
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location '/user/taxi_trips/';
When I looked at the output from the 'trips_raw' table created by the query above, I saw that both the 'tpep_pickup_date_time' and 'tpep_dropoff_datetime' columns are 'NULL' in all rows. I have seen other threads talked about the reason being that the '1/1/2018 11:13:00 AM' timestamp format is not accepted by Hive, but problem is that's the timestamp format I have in my csv source data (as you can see from screenshot here).
I could specify those 2 timestamp columns as 'string' and Hive will be able to render them correctly, but I still would want those 2 columns to be 'timestamp' type so specifying those 2 columns as 'string' type is not a viable option here.
I had also tried the following technique using recommendation from this site (https://community.hortonworks.com/questions/55266/hive-date-time-problem.html) but had no success:
Create the 'trips_raw' table using 'string' as type for the 2 timestamp columns. This allows the resulting table to render the timestamps correctly, albeit in 'string' type. The Hive command I used is shown below:
create external table trips_raw
(
VendorID int,
tpep_pickup_datetime string,
tpep_dropoff_datetime string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location
'/user/taxi_trips/';
When I look at the resulting table, the dates are shown as string as you can see from this screenshot below.
But as I had mentioned earlier, I want the time columns to be in timestamp type and not string type. Therefore in the next 2 steps I tried to create a blank table and then insert the data from the table created from Step 1 but converting the string to timestamp this time.
Create an external blank table called 'trips_not_raw' using the following Hive commands:
create external table trips_not_raw
(VendorID int,
tpep_pickup_datetime timestamp,
tpep_dropoff_datetime timestamp
);
Insert data from 'trips_raw' table (which was mentioned earlier in this question), using the Hive commands below:
insert into table trips_not_raw select vendorid,
from_unixtime(unix_timestamp(tpep_pickup_datetime, 'MM/dd/yyyy HH:mm:ss
aa')) as tpep_pickup_datetime,
from_unixtime(unix_timestamp(tpep_dropoff_datetime, 'MM/dd/yyyy HH:mm:ss
aa')) as tpep_dropoff_datetime
from trips_raw;
Doing this inserts the rows into the blank table 'trips_not_raw', but the results from the 2 timestamp columns still showed as 'Null' as you can see from the screenshot below:
Is there a simple way to store the 2 time columns as 'timestamp' type and not 'string', but still be able to render them correctly in the output without seeing 'Null/None'?
Upvotes: 0
Views: 2116
Reputation: 1772
I'm afraid you need to parse timestamp column and then cast string as timestamp. For example,
select cast(regexp_replace('1/1/2018 11:13:00 AM', '(\\d{1,2})/(\\d{1,2})/(\\d{4})\\s(\\d{2}:\\d{2}:\\d{2}) \\w{2}', '$3-$1-$2 $4') as timestamp)
You can create and use a macro function for convenience, e.g.,
create temporary macro parse_date (ts string)
cast(regexp_replace(ts, '(\\d{1,2})/(\\d{1,2})/(\\d{4})\\s(\\d{2}:\\d{2}:\\d{2}) \\w{2}', '$3-$1-$2 $4') as timestamp);
then use it as follows
select parse_date('1/1/2018 11:13:00 AM');
Upvotes: 1