Reputation: 449
My Source is a log file having "þ" as delimiter.I am trying to read this file in Pig.Please look at the options I tried.
Option 1 :
Using PigStorage("þ") - This does'nt work out as it cant handle unicode characters.
Option 2 :
I tried reading the lines as string and tried to split the line with "þ".This also does'nt work out as the STRSPLIT left out the last field as it has "\n" in the end.
I can see multiple questions in web, but unable to find a solution. Kindly direct me with this.
Thorn Details : http://www.fileformat.info/info/unicode/char/fe/index.htm
Upvotes: 0
Views: 816
Reputation: 4724
Is this the solution are you expecting?
input.txt:
helloþworldþhelloþworld
helloþworldþhelloþworld
helloþworldþhelloþworld
helloþworldþhelloþworld
helloþworldþhelloþworld
PigScript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)þ(.*)þ(.*)þ(.*)'));
dump B;
Output:
(hello,world,hello,world)
(hello,world,hello,world)
(hello,world,hello,world)
(hello,world,hello,world)
(hello,world,hello,world)
Added 2nd option with different datatypes:
input.txt
helloþ1234þ1970-01-01T00:00:00.000+00:00þworld
helloþ4567þ1990-01-01T00:00:00.000+00:00þworld
helloþ8901þ2001-01-01T00:00:00.000+00:00þworld
helloþ9876þ2014-01-01T00:00:00.000+00:00þworld
PigScript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)þ(.*)þ(.*)þ(.*)')) as (f1:chararray,f2:long,f3:datetime,f4:chararray);
DUMP B;
DESCRIBE B;
Output:
(hello,1234,1970-01-01T00:00:00.000+00:00,world)
(hello,4567,1990-01-01T00:00:00.000+00:00,world)
(hello,8901,2001-01-01T00:00:00.000+00:00,world)
(hello,9876,2014-01-01T00:00:00.000+00:00,world)
B: {f1: chararray,f2: long,f3: datetime,f4: chararray}
Another thorn symbol A¾:
input.txt
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0
PigScript:
A = LOAD 'jinput.txt' as line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)A¾(.*)þ(.*)þ(.*)þ(.*)þ(.*)')) as (f1:long,f2:datetime,f3:datetime,f4:int,f5:double,f6:int);
DUMP B;
describe B;
Output:
(1077,04-01-2014,04-30-2014,0,0.0,0)
(1077,04-01-2014,04-30-2014,0,0.0,0)
(1077,04-01-2014,04-30-2014,0,0.0,0)
B: {f1: long,f2: datetime,f3: datetime,f4: int,f5: double,f6: int}
}
Upvotes: 2
Reputation: 3284
This should work (replace the unicode code point with the one that's working for you, this is for capital thorn):
A = LOAD 'input' USING
B = FOREACH A GENERATE STRSPLIT(f1, '\\u00DE', -1);
I don't see why the last field should be left out.
Somehow, this does not work:
A = LOAD 'input' USING PigStorage('\00DE');
Upvotes: 0