Greatest-n-per-group within a second group

Question

I've got a database where each entry is an edge with a source tag, a relationship and a weight. I want to perform a query where given a source tag, I get the top n edges by weight with that source tag per relationship.

For example, given the entries

Id   Source   Relationship   End      Weight
-----------------------------------------
1    cat       isA           feline   56
2    cat       isA           animal   12
3    cat       isA           pet      37
4    cat       desires       food     5
5    cat       desires       play     88
6    dog       isA           canine   72

If I queried using "cat" as a source and n=2, the result should be

Id   Source   Relationship   End      Weight
-----------------------------------------
1    cat       isA           feline   56
3    cat       isA           pet      37
4    cat       desires       food     5
5    cat       desires       play     88

I've tried several different approaches based on other questions.

The most sucessful so far is based on How to SELECT the newest four items per category?

SELECT *
FROM tablename t1
JOIN tablename t2 ON (t1.relationship = t2.relationship)
LEFT OUTER JOIN tablename t3
  ON (t1.relationship = t3.relationship AND t2.weight < t3.weight)
WHERE t1.source = "cat"
  AND t3.relationship IS NULL
ORDER BY t2.weight DESC;

However, this returns all the edges with source="cat" in sorted order. If I try to add LIMIT, I get the edges with the top weights not by group.

The other thing that I have tried is

SELECT *
FROM tablename t1
WHERE t1.source="cat"
AND (
     SELECT COUNT(*) 
     FROM tablename t2
     WHERE t1.relationship = t2.relationship 
     AND t1.weight <= t2.weight           
) <= 2;

This returns

Id   Source   Relationship   End      Weight
-----------------------------------------
1    cat       isA           feline   56
4    cat       desires       food     5
5    cat       desires       play     88

Because edge 6 has a higher weight for the isA relationship than edge 2, but is excluded from the results because the source="dog"

I am very new to databases, so if I need to take a completely different approach, let me know. I'm not afraid of starting over.

Bill Karwin · Accepted Answer

Doing this with the correlated subquery is indeed inefficient, because MySQL has to run the subquery for every row of the outer query, just to decide if the row in the outer query meets the conditions. That's a lot of overhead.

Here's a method using no subquery:

SELECT t1.*
FROM tablename t1
JOIN tablename t2 ON t1.source = t2.source and t1.relationship = t2.relationship
  AND t1.weight <= t2.weight
WHERE t1.source = 'cat' 
GROUP BY t1.id
HAVING COUNT(*) <= 2;

And here's a method using neither subquery, nor joins/group by:

SELECT *
FROM (
    SELECT tablename.*, IF(@r = relationship, @n:=@n+1, @n:=1) AS _n, 
        @r:=relationship AS _r
    FROM (SELECT @r:=null, @n:=1) _init, tablename
    WHERE source = 'cat'
    ORDER BY relationship, weight DESC
) AS _t
WHERE _n <= 2;

These solutions also need some tiebreaker in case there are multiple rows with the same top weights. But that applies to all the solutions.

The simpler solution, which wouldn't require special gymnastics or tiebreakers, is to use SQL window functions like ROW_NUMBER() OVER (PARTITION BY relationship), but MySQL does not support these.

Greatest-n-per-group within a second group

Answers (2)

Related Questions