Create a Hive table from datasource with caret delimited, quoted columns and nulls encoded as '\N'

Question

I have a large set of gzip files that need to be loaded into Hive. The columns are strings, encapsulated in double quotes, and delimited by carets (^). There are some null values in the dataset that are encoded as \N, e.g.

"Doug Cutting"^"Hadoop"^"United States"
"Hadley Wickham"^"R"^"New Zealand"
"Alex Woolford"^\N^"United Kingdom"

The dataset, to my eyes, looks like a CSV (or "^SV"), and so I created a table using the OpenCSVSerde:

CREATE TABLE `technologists`(
  `name` string, 
  `famous_for` string,
  `country_of_birth` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
WITH SERDEPROPERTIES ( 
  'quoteChar'='\"', 
  'separatorChar'='^') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/some/hdfs/location'

This worked well except for the null values, which show up as an 'N', e.g.

hive> select * from technologists;
OK
Doug Cutting    Hadoop  United States
Hadley Wickham  R       New Zealand
Alex Woolford   N       United Kingdom

Do you know if there's a simple way to create this table without writing a custom SerDe or editing the files? Can the RegexSerDe replace a \N with a real null?

Create a Hive table from datasource with caret delimited, quoted columns and nulls encoded as '\N'

Answers (1)

Related Questions

Create a Hive table from datasource with caret delimited, quoted columns and nulls encoded as &#39;\N&#39;

Answers (1)

Related Questions

Create a Hive table from datasource with caret delimited, quoted columns and nulls encoded as '\N'