Reputation: 323
I have a CSV file that is delimited by double quotes and a comma comma. It looks a bit like the following:
"Text from vendor ","Vendor Name, Inc."," blah blah ","Next string","", 1234
I am trying to import it into table in Hbase using Hive.
I am able to create a table from Hive in Hbase using something like the following:
hive> CREATE TABLE exampletable1(tax_numb int, tax_name string, tax_addr string, tax_city string, tax_stat string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:tax_name,cf:tax_addr,cf:tax_city,cf:tax_stat")
TBLPROPERTIES ("hbase.table.name" = "hiveexampletable1");
I was also able to add a table to Hive where I imported the CSV file (although with a problem with the double quotes) using a command like:
hive> create table example2(tax_numb int, tax_name string, tax_addr string, tax_city string, tax_stat string) row format delimited fields terminated by ',' stored as textfile;
I am however, unable to be able to integrate the "delimited fields" bit into the command where I create an external table in Hbase.
As I'm relatively new to the Hive-Hbase combination I went through a number of tutorials before arriving to a point where I deemed it better to ask the community for help. I ran a number of searches on google, to no avail.
Any help / suggestions would be much appreciated.
Upvotes: 3
Views: 5890
Reputation: 323
I use org.apache.hadoop.hive.serde2.RegexSerDe as serde to parse this kind of files. For example if I have a csv with 4 fields which are integer, string, integer, string I could use:
CREATE EXTERNAL TABLE mytable (
F1 BIGINT, F2 STRING,
F2 INT, F4 STRING
)
row format SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
with SERDEPROPERTIES ("input.regex" = "^\"([^\"]*)\",\"([^\"]*)\",\"([^\"]*)\",\"([^\"]*)\"$")
LOCATION "/somepath";
The regexp y always the same, so I generate it with python like this:
>>> r = "^" + ",".join(['"([^"]*)"' for i in xrange(0,4)]) + "$"
>>> print str(r).replace('"', '\\"')
^\"([^\"]*)\",\"([^\"]*)\",\"([^\"]*)\",\"([^\"]*)\"$
Greetings!
Upvotes: 1
Reputation: 323
After several attempts at finding a beautiful solution I had to finally resort to going old-school and fall back on awk. The command I used looks kinda like this:
$ cat inputCSVfile.csv | awk '{print substr ($0, 252, 20) echo "|" substr ($0, 133, 2) echo "|" substr ($0, 297, 13)}'
This gave me what I needed. Then I imported the data into Hive and popped it into Hbase from there. Hope it helps someone in the future.
Upvotes: 0