ashwini
ashwini

Reputation: 531

How to load data in pig with more than one ctrl delimiter

I am loading a file in pig having delimiter as '^A^E^A'

I tried below command however it is not working.

data = LOAD 'test.txt' USING PigStorage('\u0001\u0005\u0001') AS (user, time, query);

Did i miss anything? or Is there any way to specify the above delimiter directly using PigStorage? how?

Thanks.

Upvotes: 1

Views: 419

Answers (3)

Sreeni
Sreeni

Reputation: 1

File_Data = LOAD 'thedata.csv' USING TextLoader(); Cleansing_Data = FOREACH File_Data GENERATE REPLACE($0,'\u0001|\u0005\|\u0001',''); STORE Cleansing_Data INTO 'tmp/Cleansing_Data.txt' USING PigStorage(); Final_Data = LOAD 'tmp/Cleansing_Data.txt' USING PigStorage(',') AS (user, time, query);

Upvotes: 0

nobody
nobody

Reputation: 11080

Load the data as line:chararray

Replace '\u0001\u0005\u0001' with a '|' or ','

Split the resulting line using the '|' or ',' to generate the required columns.

data = LOAD 'test.txt' as (line:chararray);
clean_data = foreach data generate REPLACE(line,'\\u0001\\u0005\\u0001','|');
new_data = foreach clean_data generate SPLIT(clean_data.$0,'|');

Upvotes: 2

Arunakiran Nulu
Arunakiran Nulu

Reputation: 2089

I believe PigStorage will not support more than one ctrl delimiter , you may have to write UDF to achieve this.

Upvotes: 0

Related Questions