Reputation: 61

How to handle a delimiter in Hive

How do we handle a data in Hive when the \t is in the value and the delimiter is also \t. Suppose for example there is a column as Street, data type as String and value as XXX\tYYY and while creating a table we have used the field delimiter as \t. How will the delimiter work? In this case will the \t in the value will also be delimited?

Upvotes: 0

Answers (3)

Murari Goswami

Reputation: 188

We have faced the same in our data load into hadoop clusters. What we did, added \\t whenever we saw the delimiter is included within a data fields and added the below in the table definition.

Row format delimited fields terminated by \t escaped by \\ lines terminated by \n

Upvotes: 0

Ashrith

Reputation: 6855

If your columns with \t values are enclosed by quote character like " the you could use csv-serde to parse the data like this:

Here is a sample dataset that I have loaded:

R1Col1  R1Col2  "R1Col3 MoreData"   R1Col4
R2Col2  R2Col2  "R2Col3 MoreData"   R2Col4

hive> add jar /path/to/csv-serde-1.1.2-0.11.0-all.jar;

Create a table with the specified serde and custom properties

hive> create table test_table(c1 string, c2 string, c3 string, c4 string)
> row format serde 'com.bizo.hive.serde.csv.CSVSerde'                
> with serdeproperties(                                              
>  "separatorChar" = "\t",
>  "quoteChar"     = "\"",
>  "escapeChar"    = "\\"
> )   
> stored as textfile;

Load your dataset into the table:

hive> load data inpath '/path/to/file/in/hdfs' into table test_table;

Do a select * from test_table to check the results

You could download the csv-serde from here.

Upvotes: 2

suiterdev

Reputation: 691

It will treat it as a delimiter, yes, same as if you had a semicolon ; in the value and told it to split on semicolon - when the text is scanned, it will see the character and interpret it as the edge of the field.

To get around this, I used sed to find-and-replace characters before loading it into Hive, or I created the Hive table with different delimiters, or left it at the default ^A, or \001, and then, when I extracted it, used sed on the output to replace the \001 with commas or tabs or whatever I needed. Running sed -i 's/oldval/newval/g' file on the command line will replace the characters in your file in place.

Is there a reason you chose to make the table with \t as the delimiter, instead of the default Hive field delimiter of ^A? Since tab is a fairly common character in text, and Hadoop/Hive is used a lot for handling text, it is tough to find a good character for delimiting.

Upvotes: 1

How to handle a delimiter in Hive

Answers (3)

Related Questions