Vinay Ram
Vinay Ram

Reputation: 61

How to handle a delimiter in Hive

How do we handle a data in Hive when the \t is in the value and the delimiter is also \t. Suppose for example there is a column as Street, data type as String and value as XXX\tYYY and while creating a table we have used the field delimiter as \t. How will the delimiter work? In this case will the \t in the value will also be delimited?

Upvotes: 0

Views: 10651

Answers (3)

Murari Goswami
Murari Goswami

Reputation: 188

We have faced the same in our data load into hadoop clusters. What we did, added \\t whenever we saw the delimiter is included within a data fields and added the below in the table definition.

Row format delimited fields terminated by \t escaped by \\ lines terminated by \n

Upvotes: 0

Ashrith
Ashrith

Reputation: 6855

If your columns with \t values are enclosed by quote character like " the you could use csv-serde to parse the data like this:

Here is a sample dataset that I have loaded:

R1Col1  R1Col2  "R1Col3 MoreData"   R1Col4
R2Col2  R2Col2  "R2Col3 MoreData"   R2Col4
  1. Register the jar from hive console

    hive> add jar /path/to/csv-serde-1.1.2-0.11.0-all.jar;
    
  2. Create a table with the specified serde and custom properties

    hive> create table test_table(c1 string, c2 string, c3 string, c4 string)
    > row format serde 'com.bizo.hive.serde.csv.CSVSerde'                
    > with serdeproperties(                                              
    >  "separatorChar" = "\t",
    >  "quoteChar"     = "\"",
    >  "escapeChar"    = "\\"
    > )   
    > stored as textfile;
    
  3. Load your dataset into the table:

    hive> load data inpath '/path/to/file/in/hdfs' into table test_table;
    
  4. Do a select * from test_table to check the results

You could download the csv-serde from here.

Upvotes: 2

suiterdev
suiterdev

Reputation: 691

It will treat it as a delimiter, yes, same as if you had a semicolon ; in the value and told it to split on semicolon - when the text is scanned, it will see the character and interpret it as the edge of the field.

To get around this, I used sed to find-and-replace characters before loading it into Hive, or I created the Hive table with different delimiters, or left it at the default ^A, or \001, and then, when I extracted it, used sed on the output to replace the \001 with commas or tabs or whatever I needed. Running sed -i 's/oldval/newval/g' file on the command line will replace the characters in your file in place.

Is there a reason you chose to make the table with \t as the delimiter, instead of the default Hive field delimiter of ^A? Since tab is a fairly common character in text, and Hadoop/Hive is used a lot for handling text, it is tough to find a good character for delimiting.

Upvotes: 1

Related Questions