user1734980
user1734980

Reputation: 45

How to remove ^A and \N from file using hive

I get the data from a staging table.I am selecting all columns data from staging table and inserting into a base table.After inserting to base table file looks like below.

val1^Aval2^Aval3^A\N^Aval4^A\N But i need data to be like this.

val1 val2 val3   val4  

^A must be removed from file and \N should be replaced by blankspace. I want to achieve this in hive any help is appreciate.

Upvotes: 0

Views: 3936

Answers (4)

Inquisitive
Inquisitive

Reputation: 1

The below works

CREATE TABLE A
ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
TBLPROPERTIES ('serialization.null.format' = '')
AS
SELECT * FROM B;

Upvotes: 0

Abhijeet Dhumal
Abhijeet Dhumal

Reputation: 1809

The output of hive is written using a field delimiter, by default hive uses CTRL-A delimiter i.e. (^A). So here you are seeing the output file with ^A characters in between it.

By default NULL values are written in the data files as \N and \N in the data files are being interpreted as NULL when querying the data.

So if you don't want \N in the output data file, you can set the default values for those data types using COALESCE function in hive. COALESCE function returns the value if it is not null and if it is null it returns the default value specified.

Approach 1: Create a base table with field delimiter '\t' i.e. tab. Override the property 'serialization.null.format'='' in create table statement to set the default value for empty string instead of \N.

CREATE TABLE base_table (
column_1 BIGINT,
column_2 STRING
) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t' 
LINES TERMINATED BY '\n' 
STORED AS TEXTFILE
LOCATION '/tmp/base_table'
TBLPROPERTIES('serialization.null.format'='');

Then insert the data in to table using below query:

INSERT OVERWRITE TABLE base_table
SELECT COALESCE(column_1, 0L), COALESCE(column_2,"")
FROM my_table
[ some WHERE clause here]

The output file you will see will be with tab separated and \N values will be set to field type default.

Approach 2:

The first approach will be the preferred one. In this approach use the insert overwrite directory on the base table and write the output to the directory want:

INSERT OVERWRITE LOCAL DIRECTORY '/home/user/my_table_output/'
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY  '\t' ESCAPED BY '"' LINES TERMINATED BY '\n' 
STORED AS TEXTFILE
SELECT COALESCE(column_1, 0L), COALESCE(column_2,"") 
FROM base_table;

In the output directory, you will see the output in expected format.

Approach 3:

From your explanation, it looks like you are trying to read this output file and I am assuming that you are reading it through mapreduce code.

You can use the delimiter '\u0001' (use with escape character while coding i.e.'\u0001') to split the input string.

And while processing each line/field you can check for '\N' string occurrence and replace it with any default value.

Upvotes: 3

nobody
nobody

Reputation: 11080

For \N

replace(column_name, "\\\N", " ")

For ^A

replace(column_name, "\\^A", "")

Upvotes: 0

Andrew
Andrew

Reputation: 8703

You can use regexp_replace for this. For example, to replace your ^A:

regexp_replace(<your column>,'\\^A',' ')

Upvotes: 0

Related Questions