san71
san71

Reputation: 47

Casting the output from Flatten and Strsplit in Pig

I am trying to parse a log extract with multiple delimiters with sample data as below using pig

CEF:0|NetScreen|Firewall/VPN||traffic:1|Permit|Low| eventId=5                  
msg=start_time\="2015-05-20 09:41:38" duration\=0 policy_id\=64

My code is as below:

A = LOAD '/user/cef.csv' USING PigStorage(' ') as  
(a:chararray,b:chararray,c:chararray,d:chararray,e:chararray,f:chararray,g:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1,   (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2),STRSPLIT(d,'=',2),STRSP     LIT(e,'=',2),STRSPLIT(f,'=',2),STRSPLIT(g,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1),        FLATTEN($2),FLATTEN($3),FLATTEN($4),FLATTEN($5);
D = FOREACH C GENERATE $2,flatten(STRSPLIT($4,'"',2)),flatten(STRSPLIT($5,'"',2)),$7,$9;
E = FOREACH D GENERATE (int)$0,(chararray)$2,(chararray)$3,(int)$5,(int)$6 as (a:int,b:chararray,c:chararray,D:int,E:int);

Now when i dump E,i get the error

grunt> 2015-05-25 04:06:48,092 [main] ERROR org.apache.pig.tools.grunt.Grunt   

- ERROR 1031: Incompatable schema: left is  

"a:int,b:chararray,c:chararray,D:int,E:int", right is ":int"

I am trying to cast the output of my flatten and strsplit operations into chararray and int.

Please let me know whether this can be done

Thank you for the help!

Upvotes: 1

Views: 1055

Answers (1)

Balduz
Balduz

Reputation: 3570

Your problem is how you use the as clause. Since you place the as after the sixth parameter, it assumes you are trying to specify that schema only for that sixth parameter. Therefore, you are assigning a schema of six fields to only one, hence the error.

Do it like this:

E = FOREACH D GENERATE (int)$0 as a:int,(chararray)$2 as b,(chararray)$3 as c,(int)$5 as d,(int)$6 as e;

However, you are casting 09:41:38" to an int, so it will give you another error once you change it. You need to check again how you are splitting the data.

In my humble opinion, you should try to split the files by their delimiter before processing them in Pig, and then load them with their delimiter and perform an union. If your data is too large, then forget this idea... But your code is going to get too messy if you have several delimiters in the same file.

Upvotes: 0

Related Questions