Reputation: 405
I've lots of files (>100k) and each file has record that is ctrl+M delimited and \n separated. Within each field, the data is "pipe" delimited. Pig treats ctrl+M as line separator when we use PigStorage(). I tried using TextLoader() and it showed the same behavior. Any suggestions on how to run this in Pig? Preprocessing of files may not be feasible in this case. Please let me know if you have any suggestions.
Sample data:
abc|^~\&|1100|7G^M0|1|2|3|4|5^Mpqr|^^^00|82|L
Final Output (1 rows - ^M Delimiter):
((abc,^~\&,1100,7G),(0,1,2,3,4,5)(pqr,^^^00,82,L))
Upvotes: 0
Views: 793
Reputation: 4724
UPDATE:
input:
abc|^~\&|1100|7G^M0|1|2|3|4|5^Mpqr|^^^00|82|L
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\^M')) AS(col1:chararray,col2:chararray,col3:chararray);
C = FOREACH B GENERATE FLATTEN(TOBAG(col1,col2,col3));
D = FOREACH C GENERATE FLATTEN(STRSPLIT($0,'\\|'));
DUMP D;
Output:
(abc,^~\&,1100,7G)
(0,1,2,3,4,5)
(pqr,^^^00,82,L)
Upvotes: 0