Reputation: 61
How do we handle a data in Hive when the \t
is in the value and the delimiter is also \t
. Suppose for example there is a column as Street
, data type as String
and value as XXX\tYYY
and while creating a table we have used the field delimiter as \t
. How will the delimiter work? In this case will the \t
in the value will also be delimited?
Upvotes: 0
Views: 10651
Reputation: 188
We have faced the same in our data load into hadoop clusters. What we did, added \\t
whenever we saw the delimiter is included within a data fields and added the below in the table definition.
Row format delimited fields terminated by \t
escaped by \\
lines terminated by \n
Upvotes: 0
Reputation: 6855
If your columns with \t
values are enclosed by quote character like "
the you could use csv-serde to parse the data like this:
Here is a sample dataset that I have loaded:
R1Col1 R1Col2 "R1Col3 MoreData" R1Col4
R2Col2 R2Col2 "R2Col3 MoreData" R2Col4
Register the jar from hive console
hive> add jar /path/to/csv-serde-1.1.2-0.11.0-all.jar;
Create a table with the specified serde and custom properties
hive> create table test_table(c1 string, c2 string, c3 string, c4 string)
> row format serde 'com.bizo.hive.serde.csv.CSVSerde'
> with serdeproperties(
> "separatorChar" = "\t",
> "quoteChar" = "\"",
> "escapeChar" = "\\"
> )
> stored as textfile;
Load your dataset into the table:
hive> load data inpath '/path/to/file/in/hdfs' into table test_table;
Do a select * from test_table
to check the results
You could download the csv-serde from here.
Upvotes: 2
Reputation: 691
It will treat it as a delimiter, yes, same as if you had a semicolon ; in the value and told it to split on semicolon - when the text is scanned, it will see the character and interpret it as the edge of the field.
To get around this, I used sed to find-and-replace characters before loading it into Hive, or I created the Hive table with different delimiters, or left it at the default ^A, or \001, and then, when I extracted it, used sed on the output to replace the \001 with commas or tabs or whatever I needed. Running sed -i 's/oldval/newval/g' file on the command line will replace the characters in your file in place.
Is there a reason you chose to make the table with \t as the delimiter, instead of the default Hive field delimiter of ^A? Since tab is a fairly common character in text, and Hadoop/Hive is used a lot for handling text, it is tough to find a good character for delimiting.
Upvotes: 1