Govind
Govind

Reputation: 449

Handle thorn delimiter in pig

My Source is a log file having "þ" as delimiter.I am trying to read this file in Pig.Please look at the options I tried.

Option 1 :

Using PigStorage("þ") - This does'nt work out as it cant handle unicode characters.

Option 2 :

I tried reading the lines as string and tried to split the line with "þ".This also does'nt work out as the STRSPLIT left out the last field as it has "\n" in the end.

I can see multiple questions in web, but unable to find a solution. Kindly direct me with this.

Thorn Details : http://www.fileformat.info/info/unicode/char/fe/index.htm

Upvotes: 0

Views: 816

Answers (2)

Sivasakthi Jayaraman
Sivasakthi Jayaraman

Reputation: 4724

Is this the solution are you expecting?

input.txt:  
helloþworldþhelloþworld  
helloþworldþhelloþworld  
helloþworldþhelloþworld  
helloþworldþhelloþworld  
helloþworldþhelloþworld  

PigScript:
A = LOAD 'input.txt' as line;  
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)þ(.*)þ(.*)þ(.*)'));  
dump B;  

Output:  
(hello,world,hello,world)  
(hello,world,hello,world)  
(hello,world,hello,world)  
(hello,world,hello,world)  
(hello,world,hello,world)  

Added 2nd option with different datatypes:

input.txt  
helloþ1234þ1970-01-01T00:00:00.000+00:00þworld  
helloþ4567þ1990-01-01T00:00:00.000+00:00þworld  
helloþ8901þ2001-01-01T00:00:00.000+00:00þworld  
helloþ9876þ2014-01-01T00:00:00.000+00:00þworld  

PigScript:  
A = LOAD 'input.txt' as line;  
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)þ(.*)þ(.*)þ(.*)')) as (f1:chararray,f2:long,f3:datetime,f4:chararray);  
DUMP B;  
DESCRIBE B;

Output:
(hello,1234,1970-01-01T00:00:00.000+00:00,world)  
(hello,4567,1990-01-01T00:00:00.000+00:00,world)  
(hello,8901,2001-01-01T00:00:00.000+00:00,world)  
(hello,9876,2014-01-01T00:00:00.000+00:00,world)  

B: {f1: chararray,f2: long,f3: datetime,f4: chararray}

Another thorn symbol A¾:

input.txt  
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0  
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0  
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0  

PigScript:  
A = LOAD 'jinput.txt' as line;  
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)A¾(.*)þ(.*)þ(.*)þ(.*)þ(.*)')) as (f1:long,f2:datetime,f3:datetime,f4:int,f5:double,f6:int);  
DUMP B;  
describe B;  

Output:  
(1077,04-01-2014,04-30-2014,0,0.0,0)  
(1077,04-01-2014,04-30-2014,0,0.0,0)  
(1077,04-01-2014,04-30-2014,0,0.0,0) 
B: {f1: long,f2: datetime,f3: datetime,f4: int,f5: double,f6: int}

}

Upvotes: 2

Frederic
Frederic

Reputation: 3284

This should work (replace the unicode code point with the one that's working for you, this is for capital thorn):

A = LOAD 'input' USING 
B = FOREACH A GENERATE STRSPLIT(f1, '\\u00DE', -1);

I don't see why the last field should be left out.

Somehow, this does not work:

A = LOAD 'input' USING PigStorage('\00DE');

Upvotes: 0

Related Questions