Reputation: 47
I am trying to parse a log extract with multiple delimiters with sample data as below using pig
CEF:0|NetScreen|Firewall/VPN||traffic:1|Permit|Low| eventId=5
msg=start_time\="2015-05-20 09:41:38" duration\=0 policy_id\=64
My code is as below:
A = LOAD '/user/cef.csv' USING PigStorage(' ') as
(a:chararray,b:chararray,c:chararray,d:chararray,e:chararray,f:chararray,g:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2),STRSPLIT(d,'=',2),STRSP LIT(e,'=',2),STRSPLIT(f,'=',2),STRSPLIT(g,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2),FLATTEN($3),FLATTEN($4),FLATTEN($5);
D = FOREACH C GENERATE $2,flatten(STRSPLIT($4,'"',2)),flatten(STRSPLIT($5,'"',2)),$7,$9;
E = FOREACH D GENERATE (int)$0,(chararray)$2,(chararray)$3,(int)$5,(int)$6 as (a:int,b:chararray,c:chararray,D:int,E:int);
Now when i dump E,i get the error
grunt> 2015-05-25 04:06:48,092 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1031: Incompatable schema: left is
"a:int,b:chararray,c:chararray,D:int,E:int", right is ":int"
I am trying to cast the output of my flatten and strsplit operations into chararray and int.
Please let me know whether this can be done
Thank you for the help!
Upvotes: 1
Views: 1055
Reputation: 3570
Your problem is how you use the as
clause. Since you place the as
after the sixth parameter, it assumes you are trying to specify that schema only for that sixth parameter. Therefore, you are assigning a schema of six fields to only one, hence the error.
Do it like this:
E = FOREACH D GENERATE (int)$0 as a:int,(chararray)$2 as b,(chararray)$3 as c,(int)$5 as d,(int)$6 as e;
However, you are casting 09:41:38"
to an int, so it will give you another error once you change it. You need to check again how you are splitting the data.
In my humble opinion, you should try to split the files by their delimiter before processing them in Pig, and then load them with their delimiter and perform an union
. If your data is too large, then forget this idea... But your code is going to get too messy if you have several delimiters in the same file.
Upvotes: 0