Mangat Rai Modi
Mangat Rai Modi

Reputation: 5706

Hive table creation using OpenCsv SerDe

I created table using following command -

 create table cust(event int, pid int, REQ_FROM_IP int, REQ_CITY_ID int, REQ_STATE_ID int, REQ_COUNTRY_ID int, key String) 
 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'                                                                                 
 WITH SERDEPROPERTIES (                                                                                                                        
  "separatorChar" = "|",                                                                                                                       
    "quoteChar"     = "'",                                                                                                                     
    "escapeChar"    = "\\"                                                                                                                     
 )STORED AS TEXTFILE; 

However the created table has all coloumns as String!

    desc cust;
    OK
event                   string                  from deserializer   
pid                     string                  from deserializer   
req_from_ip             string                  from deserializer   
req_city_id             string                  from deserializer   
req_state_id            string                  from deserializer   
req_country_id          string                  from deserializer   
key                     string                  from deserializer  

Is it the SerDe messing with the datastructure?

Upvotes: 2

Views: 6425

Answers (1)

rchang
rchang

Reputation: 5236

From perusing the source code, it looks like the OpenCSVSerde will always output string columns without regard for what types were actually specified in the HiveQL query.

A potential (not very pretty) workaround to this could be to use the OpenCSVSerde to format a staging table (the definition can be identical to what you currently have for cust, maybe call it cust_staging. If your use case is that you're trying to load data in that format and trying to work with it in Hive with the "correct" data types, you could populate cust_staging as you would normally have loaded the table and then populate your target table cust with a CTAS statement like:

CREATE TABLE cust AS SELECT
  CAST(event AS INT) AS event,
  CAST(pid AS INT) AS pid,
  CAST(REQ_FROM_IP AS INT) AS REQ_FROM_IP,
  CAST(REQ_CITY_ID AS INT) AS REQ_CITY_ID,
  CAST(REQ_STATE_ID AS INT) AS REQ_STATE_ID,
  CAST(REQ_COUNTRY_ID AS INT) AS REQ_COUNTRY_ID,
  key
FROM cust_staging;

And now cust will have the column types you were expecting:

hive> DESCRIBE cust;
OK
event                       int                                 
pid                         int                                 
req_from_ip                 int                                 
req_city_id                 int                                 
req_state_id                int                                 
req_country_id              int                                 
key                         string                              
Time taken: 0.546 seconds, Fetched: 7 row(s)

Per the excellent input from @JeremyBeard, if you are okay with never materializing the data with it's "correct" types, cust can actually be a view:

CREATE VIEW `cust` -- All the other stuff is the same

To further drill open this line of reasoning, if your use-case is entirely read-only (you never have to actually modify this data), you could define cust_staging as an external table (which still uses the OpenCSVSerde) that points to your pipe-separated data files, and then define cust as a view into this external table.

However, if your use case is that you need store a table with non-string columns in the OpenCSVSerde format, then you can still create the cust_staging table (again, with the same definition you originally had) and populate data in the reverse direction:

INSERT INTO TABLE cust_staging SELECT * FROM cust;

And if all of that still feels not quite elegant enough (understandable) - the solution may be to extend the OpenCSVSerde into a custom SerDe that will do what you want it to do without the need for intermediary staging tables.

Upvotes: 4

Related Questions