jwmajors81
jwmajors81

Reputation: 1430

Process Flat File with Pig

How do you process a flat file with Pig? For example, if you had a line containing a record where the first four positions were the year, the next 5 were a product code, and the last 8 contained the MSRP, how would you query this data with Pig? I'm probably missing something simple, but everything I've found thus far requires a delimiter to be used when loading data with Pig.

Some sample data is provided below:

1999ABCDE12234.00
2000DCEFS00020.00
2012FFEWS00005.55

Thanks in advance.

Jeremy

Upvotes: 2

Views: 647

Answers (3)

Konstantin Kudryavtsev
Konstantin Kudryavtsev

Reputation: 592

Both previous answers are great. Also, you can implement your own UDF if your input string is complicated or conditional parsing is required

Upvotes: 0

Adrian Seungjin Lee
Adrian Seungjin Lee

Reputation: 1666

also, there is a builtin SUBSTRING function,

A = LOAD 'flat.txt' as (line:chararray);

B = FOREACH A GENERATE SUBSTRING(line,0,3),SUBSTRING(line,4,8),SUBSTRING(line,9,16);

dump B;

Upvotes: 1

Lorand Bendig
Lorand Bendig

Reputation: 10650

One way to split a line based on positions is to use REGEX_EXTRACT_ALL .

E.g:

A = LOAD 'flat.txt' as (line:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, 
      '^(.{1,4})(.{1,5})(.*)$')) AS (year:int, prod_code:chararray, msrp:double);
dump B;
(1999,ABCDE,12234.00)
(2000,DCEFS,00020.00)
(2012,FFEWS,00005.55)

Upvotes: 4

Related Questions