OneUser
OneUser

Reputation: 187

replace comma(,) only if its inside quotes("") in Pig

I have data like this:

1,234,"john, lee", [email protected]

I want to remove , inside "" with space using pig script. So that my data will look like:

1,234,john lee, [email protected]

I tried using CSVExcelStorage to load this data but i need to use '-tagFile' option as well which is not supported in CSVExcelStorage . So i am planning to use PigStorage only and then replace any comma (,) inside quotes. I am stuck on this. Any help is highly appreciated. Thanks

Upvotes: 0

Views: 1129

Answers (3)

OneUser
OneUser

Reputation: 187

I got the perfect way to do this. A very generic solution is as below:

data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray, record:chararray);

/*replace comma(,) if it appears in column content*/
replaceComma = FOREACH data GENERATE filename, REPLACE (record, ',(?!(([^\\"]*\\"){2})*[^\\"]*$)', '');

/*replace the quotes("") which is present around the column if it have comma(,) as its a csv file feature*/
replaceQuotes = FOREACH replaceComma GENERATE filename, REPLACE ($4,'"','') as record;

Detailed use case is available at my blog

Upvotes: 0

nobody
nobody

Reputation: 11080

Load it into a single field and then use STRSPLIT and REPLACE

A = LOAD 'data.csv' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE STRSPLIT(line,'\\"',3); 
C = FOREACH B GENERATE REPLACE($1,',','');
D = FOREACH C GENERATE CONCAT(CONCAT($0,$1),$2); -- You can further use STRSPLIT to get individual fields or just CONCAT
E = FOREACH D GENERATE STRSPLIT(D.$0,',',4);
DUMP E;

A

1,234,"john, lee", [email protected]

B

(1,234,)(john, lee)(, [email protected])

C

(1,234,)(john lee)(, [email protected])

D

(1,234,john lee, [email protected])

E

(1),(234),(john lee),([email protected])

Upvotes: 0

Taha Naqvi
Taha Naqvi

Reputation: 1766

Below command will help:

csvFile = load '/path/to/file' using PigStorage(',');
result = foreach csvFile generate $0 as (field1:chararray),$1 as (field2:chararray),CONCAT(REPLACE($2, '\\"', '') , REPLACE($3, '\\"', '')) as field3,$4 as (field4:chararray);

Ouput:

(1,234,john lee, [email protected])

Upvotes: 1

Related Questions