Reputation: 5706
I created table using following command -
create table cust(event int, pid int, REQ_FROM_IP int, REQ_CITY_ID int, REQ_STATE_ID int, REQ_COUNTRY_ID int, key String)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "|",
"quoteChar" = "'",
"escapeChar" = "\\"
)STORED AS TEXTFILE;
However the created table has all coloumns as String!
desc cust;
OK
event string from deserializer
pid string from deserializer
req_from_ip string from deserializer
req_city_id string from deserializer
req_state_id string from deserializer
req_country_id string from deserializer
key string from deserializer
Is it the SerDe messing with the datastructure?
Upvotes: 2
Views: 6425
Reputation: 5236
From perusing the source code, it looks like the OpenCSVSerde will always output string columns without regard for what types were actually specified in the HiveQL query.
A potential (not very pretty) workaround to this could be to use the OpenCSVSerde to format a staging table (the definition can be identical to what you currently have for cust
, maybe call it cust_staging
. If your use case is that you're trying to load data in that format and trying to work with it in Hive with the "correct" data types, you could populate cust_staging
as you would normally have loaded the table and then populate your target table cust
with a CTAS statement like:
CREATE TABLE cust AS SELECT
CAST(event AS INT) AS event,
CAST(pid AS INT) AS pid,
CAST(REQ_FROM_IP AS INT) AS REQ_FROM_IP,
CAST(REQ_CITY_ID AS INT) AS REQ_CITY_ID,
CAST(REQ_STATE_ID AS INT) AS REQ_STATE_ID,
CAST(REQ_COUNTRY_ID AS INT) AS REQ_COUNTRY_ID,
key
FROM cust_staging;
And now cust
will have the column types you were expecting:
hive> DESCRIBE cust;
OK
event int
pid int
req_from_ip int
req_city_id int
req_state_id int
req_country_id int
key string
Time taken: 0.546 seconds, Fetched: 7 row(s)
Per the excellent input from @JeremyBeard, if you are okay with never materializing the data with it's "correct" types, cust
can actually be a view:
CREATE VIEW `cust` -- All the other stuff is the same
To further drill open this line of reasoning, if your use-case is entirely read-only (you never have to actually modify this data), you could define cust_staging
as an external table (which still uses the OpenCSVSerde) that points to your pipe-separated data files, and then define cust
as a view into this external table.
However, if your use case is that you need store a table with non-string columns in the OpenCSVSerde format, then you can still create the cust_staging
table (again, with the same definition you originally had) and populate data in the reverse direction:
INSERT INTO TABLE cust_staging SELECT * FROM cust;
And if all of that still feels not quite elegant enough (understandable) - the solution may be to extend the OpenCSVSerde into a custom SerDe that will do what you want it to do without the need for intermediary staging tables.
Upvotes: 4