PPPP
PPPP

Reputation: 601

How to load CSV data with embedded double quote using CSV serde in Hive. Without updating the incoming data file

I have text file like below :

1,"TEST"Data","SAMPLE DATA"

and the table structure is like this :

CREATE TABLE test1( id string, col1 string , col2 string )
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
  LOCATION 'mylocation/test1'`

When I am putting the file in concerned HDFS location. 2nd and 03 rd column are populating as null that is because of the double quote in between (TEST"Data).

One way is to update the data file using escape character "/" but we are not allowed to update the incoming data. How can I load data properly and escape these embedded double quotes.

Appreciate the help !!

Upvotes: 3

Views: 2473

Answers (1)

David דודו Markovitz
David דודו Markovitz

Reputation: 44911

You can load it using RegexSerDe

Demo

bash

mkdir test1

cat>test1/file.txt
1,"TEST"Data","SAMPLE DATA"
2,"TEST Data","SAMPLE DATA"
3,"TEST","Data","SAMPLE","DATA"

hdfs dfs -put test1 /tmp

hive

create external table test1 
( 
    id      string
   ,col1    string
   ,col2    string 
)
    row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    with serdeproperties  
    (
        'input.regex' = '^(\\d+?),"(.*)","(.*)"$'
    )
    location '/tmp/test1'
;

select * from test1
;

+----------+----------------------+-------------+
| test1.id |      test1.col1      | test1.col2  |
+----------+----------------------+-------------+
| 1        | TEST"Data            | SAMPLE DATA |
| 2        | TEST Data            | SAMPLE DATA |
| 3        | TEST","Data","SAMPLE | DATA        |
+----------+----------------------+-------------+

Upvotes: 2

Related Questions