Reputation: 187
I have data like this:
1,234,"john, lee", [email protected]
I want to remove , inside "" with space using pig script. So that my data will look like:
1,234,john lee, [email protected]
I tried using CSVExcelStorage to load this data but i need to use '-tagFile' option as well which is not supported in CSVExcelStorage . So i am planning to use PigStorage only and then replace any comma (,) inside quotes. I am stuck on this. Any help is highly appreciated. Thanks
Upvotes: 0
Views: 1129
Reputation: 187
I got the perfect way to do this. A very generic solution is as below:
data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray, record:chararray);
/*replace comma(,) if it appears in column content*/
replaceComma = FOREACH data GENERATE filename, REPLACE (record, ',(?!(([^\\"]*\\"){2})*[^\\"]*$)', '');
/*replace the quotes("") which is present around the column if it have comma(,) as its a csv file feature*/
replaceQuotes = FOREACH replaceComma GENERATE filename, REPLACE ($4,'"','') as record;
Detailed use case is available at my blog
Upvotes: 0
Reputation: 11080
Load it into a single field and then use STRSPLIT and REPLACE
A = LOAD 'data.csv' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE STRSPLIT(line,'\\"',3);
C = FOREACH B GENERATE REPLACE($1,',','');
D = FOREACH C GENERATE CONCAT(CONCAT($0,$1),$2); -- You can further use STRSPLIT to get individual fields or just CONCAT
E = FOREACH D GENERATE STRSPLIT(D.$0,',',4);
DUMP E;
A
1,234,"john, lee", [email protected]
B
(1,234,)(john, lee)(, [email protected])
C
(1,234,)(john lee)(, [email protected])
D
(1,234,john lee, [email protected])
E
(1),(234),(john lee),([email protected])
Upvotes: 0
Reputation: 1766
Below command will help:
csvFile = load '/path/to/file' using PigStorage(',');
result = foreach csvFile generate $0 as (field1:chararray),$1 as (field2:chararray),CONCAT(REPLACE($2, '\\"', '') , REPLACE($3, '\\"', '')) as field3,$4 as (field4:chararray);
Ouput:
(1,234,john lee, [email protected])
Upvotes: 1