Alex Woolford
Alex Woolford

Reputation: 4563

Create a Hive table from datasource with caret delimited, quoted columns and nulls encoded as '\N'

I have a large set of gzip files that need to be loaded into Hive. The columns are strings, encapsulated in double quotes, and delimited by carets (^). There are some null values in the dataset that are encoded as \N, e.g.

"Doug Cutting"^"Hadoop"^"United States"
"Hadley Wickham"^"R"^"New Zealand"
"Alex Woolford"^\N^"United Kingdom"

The dataset, to my eyes, looks like a CSV (or "^SV"), and so I created a table using the OpenCSVSerde:

CREATE TABLE `technologists`(
  `name` string, 
  `famous_for` string,
  `country_of_birth` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
WITH SERDEPROPERTIES ( 
  'quoteChar'='\"', 
  'separatorChar'='^') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/some/hdfs/location'

This worked well except for the null values, which show up as an 'N', e.g.

hive> select * from technologists;
OK
Doug Cutting    Hadoop  United States
Hadley Wickham  R       New Zealand
Alex Woolford   N       United Kingdom

Do you know if there's a simple way to create this table without writing a custom SerDe or editing the files? Can the RegexSerDe replace a \N with a real null?

Upvotes: 1

Views: 459

Answers (1)

Legato
Legato

Reputation: 1081

Looks like this serde uses a backslash as the default escape character, and therefore \N is stripped into N. Add 'escapeChar' to your serde properties and set it to something other than backslash. I'd try to set it to the same as the quoteChar value (double quote) if possible. If the serde complains that it isn't allowed then maybe use some non printable character instead.

Upvotes: 1

Related Questions