Reputation: 323

Import CSV file delimited by double quotes and comma into Hbase using Hive

I have a CSV file that is delimited by double quotes and a comma comma. It looks a bit like the following:

"Text from vendor ","Vendor Name, Inc."," blah blah ","Next string","", 1234

I am trying to import it into table in Hbase using Hive.

I am able to create a table from Hive in Hbase using something like the following:

hive> CREATE TABLE exampletable1(tax_numb int, tax_name string, tax_addr string, tax_city string, tax_stat string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:tax_name,cf:tax_addr,cf:tax_city,cf:tax_stat")
TBLPROPERTIES ("hbase.table.name" = "hiveexampletable1");

I was also able to add a table to Hive where I imported the CSV file (although with a problem with the double quotes) using a command like:

hive> create table example2(tax_numb int, tax_name string, tax_addr string, tax_city string, tax_stat string) row format delimited fields terminated by ',' stored as textfile;

I am however, unable to be able to integrate the "delimited fields" bit into the command where I create an external table in Hbase.

As I'm relatively new to the Hive-Hbase combination I went through a number of tutorials before arriving to a point where I deemed it better to ask the community for help. I ran a number of searches on google, to no avail.

Any help / suggestions would be much appreciated.

Upvotes: 3

Answers (2)

juanrh0011

Reputation: 323

I use org.apache.hadoop.hive.serde2.RegexSerDe as serde to parse this kind of files. For example if I have a csv with 4 fields which are integer, string, integer, string I could use:

CREATE EXTERNAL TABLE mytable (
    F1 BIGINT, F2 STRING, 
    F2 INT, F4 STRING
)
row format SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
with SERDEPROPERTIES ("input.regex" = "^\"([^\"]*)\",\"([^\"]*)\",\"([^\"]*)\",\"([^\"]*)\"$")
LOCATION "/somepath";

The regexp y always the same, so I generate it with python like this:

>>> r = "^" + ",".join(['"([^"]*)"' for i in xrange(0,4)]) + "$"
>>> print str(r).replace('"', '\\"')
^\"([^\"]*)\",\"([^\"]*)\",\"([^\"]*)\",\"([^\"]*)\"$

Greetings!

Upvotes: 1

Soucrit

Reputation: 323

After several attempts at finding a beautiful solution I had to finally resort to going old-school and fall back on awk. The command I used looks kinda like this:

$ cat inputCSVfile.csv | awk '{print substr ($0, 252, 20) echo "|" substr ($0, 133, 2) echo "|" substr ($0, 297, 13)}'

This gave me what I needed. Then I imported the data into Hive and popped it into Hbase from there. Hope it helps someone in the future.

Upvotes: 0

Import CSV file delimited by double quotes and comma into Hbase using Hive

Answers (2)

Related Questions