Reputation: 1430
How do you process a flat file with Pig? For example, if you had a line containing a record where the first four positions were the year, the next 5 were a product code, and the last 8 contained the MSRP, how would you query this data with Pig? I'm probably missing something simple, but everything I've found thus far requires a delimiter to be used when loading data with Pig.
Some sample data is provided below:
1999ABCDE12234.00
2000DCEFS00020.00
2012FFEWS00005.55
Thanks in advance.
Jeremy
Upvotes: 2
Views: 647
Reputation: 592
Both previous answers are great. Also, you can implement your own UDF if your input string is complicated or conditional parsing is required
Upvotes: 0
Reputation: 1666
also, there is a builtin SUBSTRING function,
A = LOAD 'flat.txt' as (line:chararray);
B = FOREACH A GENERATE SUBSTRING(line,0,3),SUBSTRING(line,4,8),SUBSTRING(line,9,16);
dump B;
Upvotes: 1
Reputation: 10650
One way to split a line based on positions is to use REGEX_EXTRACT_ALL .
E.g:
A = LOAD 'flat.txt' as (line:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,
'^(.{1,4})(.{1,5})(.*)$')) AS (year:int, prod_code:chararray, msrp:double);
dump B;
(1999,ABCDE,12234.00)
(2000,DCEFS,00020.00)
(2012,FFEWS,00005.55)
Upvotes: 4