Reputation: 601
I have text file like below :
1,"TEST"Data","SAMPLE DATA"
and the table structure is like this :
CREATE TABLE test1( id string, col1 string , col2 string )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'mylocation/test1'`
When I am putting the file in concerned HDFS location. 2nd and 03 rd column are populating as null
that is because of the double quote in between (TEST"Data).
One way is to update the data file using escape character "/" but we are not allowed to update the incoming data. How can I load data properly and escape these embedded double quotes.
Appreciate the help !!
Upvotes: 3
Views: 2473
Reputation: 44911
You can load it using RegexSerDe
bash
mkdir test1
cat>test1/file.txt
1,"TEST"Data","SAMPLE DATA"
2,"TEST Data","SAMPLE DATA"
3,"TEST","Data","SAMPLE","DATA"
hdfs dfs -put test1 /tmp
hive
create external table test1
(
id string
,col1 string
,col2 string
)
row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties
(
'input.regex' = '^(\\d+?),"(.*)","(.*)"$'
)
location '/tmp/test1'
;
select * from test1
;
+----------+----------------------+-------------+
| test1.id | test1.col1 | test1.col2 |
+----------+----------------------+-------------+
| 1 | TEST"Data | SAMPLE DATA |
| 2 | TEST Data | SAMPLE DATA |
| 3 | TEST","Data","SAMPLE | DATA |
+----------+----------------------+-------------+
Upvotes: 2