user3327034
user3327034

Reputation: 405

Pig latin load ctrl+M delimited text

I've lots of files (>100k) and each file has record that is ctrl+M delimited and \n separated. Within each field, the data is "pipe" delimited. Pig treats ctrl+M as line separator when we use PigStorage(). I tried using TextLoader() and it showed the same behavior. Any suggestions on how to run this in Pig? Preprocessing of files may not be feasible in this case. Please let me know if you have any suggestions.

Sample data:

abc|^~\&|1100|7G^M0|1|2|3|4|5^Mpqr|^^^00|82|L

Final Output (1 rows - ^M Delimiter): ((abc,^~\&,1100,7G),(0,1,2,3,4,5)(pqr,^^^00,82,L))

Upvotes: 0

Views: 793

Answers (1)

Sivasakthi Jayaraman
Sivasakthi Jayaraman

Reputation: 4724

UPDATE:

input:

abc|^~\&|1100|7G^M0|1|2|3|4|5^Mpqr|^^^00|82|L

PigScript:

A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\^M')) AS(col1:chararray,col2:chararray,col3:chararray);
C = FOREACH B GENERATE FLATTEN(TOBAG(col1,col2,col3));
D = FOREACH C GENERATE FLATTEN(STRSPLIT($0,'\\|'));
DUMP D;

Output:

(abc,^~\&,1100,7G)
(0,1,2,3,4,5)
(pqr,^^^00,82,L)

Upvotes: 0

Related Questions