Reputation: 862
I am working with Pig 0.12.1
and Map-R
. I am trying to find max of a field after grouping the relation on some other field. Refer the following pig script and structure of relation in comments-
r1 = foreach SomeRelation generate flatten(group) as (c1 , c2);
-- r1: {c1: biginteger,c2: biginteger}
r2 = group r1 by c1;
-- r2: {group: chararray,r1: {(c1: chararray,c2: biginteger)}}
DUMP r2;
/* output -
1234|{(1234,9876)}
2345|{(2345,8765)}
3456|{(3456,7654)}
4567|{(4567,6543)}
*/
r3 = foreach r2 generate group as c1, MAX(r1.c2) as c2;
I am getting the following error
Could not infer the matching function for org.apache.pig.builtin.MAX as multiple or none of them fit. Please use an explicit cast.
Script Explained-
I am flattening group of SomeRelation into c1, c2 and then regrouping on c1 to generate max of c2 with each c1 group.
Please suggest.
Upvotes: 0
Views: 636
Reputation: 1691
We now know the problem was the unability of MAX to handle biginteger.
You should be able to group and get the max like this, and compare results with combination of order + limit :
r1 = FOREACH SomeRelation GENERATE FLATTEN(group) AS (c1, c2);
r3 = FOREACH (group r1 by c1) {
-- you may want to apply a function on a single column
-- or compare sort + limit to MAX
list = ORDER $1 BY c2 DESC;
list_max = LIMIT list 1;
GENERATE group AS c1, MAX(r1.c2) AS c2, list_max;
}
Upvotes: 0
Reputation: 862
Well it looks like the problem is that Pig doesn't allow MAX(or for that matter aggregate functions like SUM etc) on biginteger. Had to use long as a datatype for this to work. Refer the following-
r1 = foreach SomeRelation generate flatten(group) as (c1 , c2:long);
-- r1: {c1: biginteger,c2: long}
Strangely, there's no documentation highlighting this almost like datatypes biginteger and bigdecimal.
Upvotes: 1
Reputation: 438
I'm not sure if you can use the group keyword under the flatten. Also, have you considered tokenizing the group before flattening it. See this for example:
load_data = LOAD '/PIG_TESTS_ALL/WordCount' as (line);
tokenizing_data = FOREACH load_data generate flatten(TOKENIZE(line)) as word;
group_data = GROUP tokenizing_data by word;
Result = FOREACH group_data generate group,COUNT(tokenizing_data);
dump Result;
This is actually for word count, You can probably build on this to find max value based on what you want to do.
Upvotes: 1