Srini
Srini

Reputation: 23

Hive - Load pipe delimited data starting with pipe

Let's say I want to create a simple table with 4 columns in Hive and load some pipe-delimited data starting with pipe.

CREATE table TEST_1 (
COL1  string,
COL2  string,
COL3  string,
COL4  string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

Raw Data:

|123|456|Dasani Bottled|5|,   
|124|455|Test Bottled |7|      

Table getting date like

----------------------------------
COL1  COL2   COL3     COL4
----------------------------------
      123   456     Dasani Bottled
      123   455     Test Bottled
----------------------------------

COL1 Getting Empty and last column not loaded. I have try to load the csv file using Hadoop put command.

Help me to resolve this issue.

Upvotes: 2

Views: 1215

Answers (2)

leftjoin
leftjoin

Reputation: 38290

If fixing data files is not an option, you can use RegexSerDe instead of LasySimpleSerDe (default SerDe for text files).

Define how your data looks in the regex. Each column should have corresponding capturing group () in the regex.

To test how your regex works before creating the table, use regex_replace

select regexp_replace('|123|456|Dasani Bottled|5|, ', --your row example
                      '^\\|(.*?)\\|(.*?)\\|(.*?)\\|(.*?)\\|.*', --4 groups are in the regex
                     '$1 $2 $3 $4'); --output space delimited fields 

Result:

123 456 Dasani Bottled 5

If the regex works as expected, create table (not necessarily external):

create external table TEST_1 (
COL1  string,
COL2  string,
COL3  string,
COL4  string
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ('input.regex'='^\\|(.*?)\\|(.*?)\\|(.*?)\\|(.*?)\\|.*')
location ....
;

Then copy files into table location or use LOAD command.

Upvotes: 2

Koushik Roy
Koushik Roy

Reputation: 7387

You can solve this by 2 ways.

  1. Remove first column before processing the file. This is clean and preferable solution.
cut -d "|" -f 2- input_filename > output_filename

Then use this output_filename as your input to the load process.

-d "|" - this says, use pipe as a delimiter. -f 2- - this says, extract everything after first field.

  1. add a dummy column in the beginning of the table like this
CREATE table TEST_1 (
dummy string,
COL1  string,
COL2  string,
COL3  string,
COL4  string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

And then proceed with the loading data. Then you can ignore this dummy column or store the data into a final table without this column or create a view on top of this to exclude this dummy column.

Upvotes: 1

Related Questions