Reputation: 4563
I have a large set of gzip files that need to be loaded into Hive. The columns are strings, encapsulated in double quotes, and delimited by carets (^
). There are some null
values in the dataset that are encoded as \N
, e.g.
"Doug Cutting"^"Hadoop"^"United States"
"Hadley Wickham"^"R"^"New Zealand"
"Alex Woolford"^\N^"United Kingdom"
The dataset, to my eyes, looks like a CSV (or "^SV"), and so I created a table using the OpenCSVSerde:
CREATE TABLE `technologists`(
`name` string,
`famous_for` string,
`country_of_birth` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'quoteChar'='\"',
'separatorChar'='^')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/some/hdfs/location'
This worked well except for the null
values, which show up as an 'N', e.g.
hive> select * from technologists;
OK
Doug Cutting Hadoop United States
Hadley Wickham R New Zealand
Alex Woolford N United Kingdom
Do you know if there's a simple way to create this table without writing a custom SerDe or editing the files? Can the RegexSerDe replace a \N
with a real null
?
Upvotes: 1
Views: 459
Reputation: 1081
Looks like this serde uses a backslash as the default escape character, and therefore \N
is stripped into N
. Add 'escapeChar' to your serde properties and set it to something other than backslash. I'd try to set it to the same as the quoteChar value (double quote) if possible. If the serde complains that it isn't allowed then maybe use some non printable character instead.
Upvotes: 1