Reputation:

SQL to select set of 10 records that collectively best fulfill a criterion

My table:

CREATE TABLE `beer`.`matches` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `hashId` int(10) unsigned NOT NULL,
  `ruleId` int(10) unsigned NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

If a hash has matched a rule, there's an entry in this table.

1) Count how many hashIds there are for each unique ruleId (AKA "how many hashes matched each rule")

SELECT COUNT(*), ruleId FROM `beer`.`matches` GROUP BY ruleId ORDER BY COUNT(*)

2) Select the 10 best rules (ruleIds), that is, select the 10 rules that combined matches the greatest number of unique hashes. This means that a rule that matches a lot of hashes is not neccessarily a good rule, if another rule covers all the same hashes. Basically I want to select the 10 ruleIds that catches the most unique hashIds.

EDIT: Basically I have a sub-optimal solution in PHP/SQL here, but depending on the data it doesn't necessarily give me the best answer to question 2). I'd be interested in a better solution. Read the comments for more information.

Upvotes: 12

Answers (5)

RBarryYoung

Reputation: 56735

The approach that I think will work best for this is based on the same logic/method that the statiscal technique multi-variate co-factors analysis uses.

That is, instead of trying to solve the inherently combinatorial problem of "What combination of 10 factors(or 'rules' for your problem) out of the existing rules, best fulfills some criterion?", it incrementally answers a much easier question "Given what I already have, what additional factor('rule'), best improves how well the criterion is fulfilled?"

Proceduraly, it goes like this: First, find the rule that has the most (distinct) hashes matching it. Don't worry about overlap with other rules, just find the single best one. Add the to a list (or table) of already-selected rules.

Now, find the next-best rule, given the rule(s) that you already have. In other words, find the rule that matches the most hashes, excluding any hashes already matched by the already-selected rule(s). Add this new rule to your already-selected list of rules, and repeat until you have 10 rules.

So this approach basically avoids the inherently combinatorial problem of trying to find the absolute, globally best solution, by finding the incrementally relative/local best solution. Some points in this approach:

It is O(n*k), where 'k' is the number of rules you want to find. Combinatorial approaches tend to be non-polynomial like O(2^n) or O(n!), which are highly undesirably, performance-wise.
It is possible that this approach will not give the absolute *best* 10 rules for your criterion. However, in my experience, it tends to do very well in the real-world cases of problems like this one. It is usually at most one or two rules off the absolute 10 best.
The SQL code for the incremental search is very easy (you already have most of it). But the SQL code to actually do it N=10 times is inherently procedural and thus requires the less standard/more idiosyncratic parts of SQL (translation: I know how to do it in TSQL, but not in MySql).

Upvotes: 2

vyegorov

Reputation: 22885

Although I come from the PostgreSQL world, I found this question really interesting and took my time to look into it.

I split the whole process into 2 subroutines:

first, a sub-query (or function) is required, that for a given ruleId combination (array) will return all possible (array)+ruleId entries with the number of unique hashId (count) found for the entry;
then, one should query the max(count) from the #1 and get a list of array+ruleId combinations from the #1. I used recursive function for this. If current recursion level matches the required amount of ruleIds (10 in question), then return found array+ruleId combinations, otherwise recursively go into this same step (#2), giving found combination as input.

As a result second function will return all combinations that will give you the maximal amount of unique hashId's for a given ruleId count.

Here's the code that will create a test setup, PostgreSQL 9.1 tested. As the original question is for MySQL, I will comment on what is going on there:

create table matches (
  id        int4   not null,
  hashId    int4   not null,
  ruleId    int4   not null,
  primary key (id)
);

insert into matches
SELECT generate_series(1,200), (random()*59+1)::int4, (random()*19+1)::int4;
-- This query will generate a 200-rows table, with:
-- - first column having values in 1-200 range (id)
-- - second column will have random numbers in 1-60 range (hashId)
-- - third column will have random numbers in 1-20 range (ruleId)

Function for the phase 1 (quite simple):

CREATE OR REPLACE FUNCTION count_matches(i_array int4[],
    OUT arr int4[], OUT cnt int4) RETURNS SETOF record
AS $$
DECLARE
    rec_o   record;
    rec_i   record;

BEGIN
    -- in the outer loop, we're going over all the combinations of input array
    -- with the ruleId appended
    FOR rec_o IN SELECT DISTINCT i_array||ruleId AS rules
        FROM matches ORDER BY 1
    LOOP
        -- in the inner loop we're counting the distinct hashId combinations
        -- for the outer loop provided array
        -- and returning the new array + count
        FOR rec_i IN SELECT count(distinct hashId) AS cnt
            FROM matches WHERE ruleId = ANY(rec_o.rules)
        LOOP

            arr := rec_o.rules;
            cnt := rec_i.cnt;
            RETURN NEXT ;
        END LOOP;
    END LOOP;

    RETURN ;
END;
$$ LANGUAGE plpgsql;

If you will give the empty array as input for this function, you will get the same results as from the case #1 of initial question:

SELECT COUNT(*), ruleId FROM `beer`.`matches` GROUP BY ruleId ORDER BY COUNT(*);
-- both queries yields same results
SELECT cnt, arr FROM count_matches(ARRAY[]::int4[]);

Now the main working function:

-- function receives 3 parameters, 2 of them have default values
-- which makes it possible to query: max_matches(10)
-- to obtain results from the initial question
CREATE OR REPLACE FUNCTION max_matches(maxi int4,
    arri int4[] DEFAULT array[]::int4[],
    curi int4 DEFAULT 1, OUT arr int4[]) RETURNS SETOF int4[]
AS $$
DECLARE
    maxcnt  int4;
    a       int4[];
    b       int4[];

BEGIN
    -- Fall out early for "easy" cases
    IF maxi < 2 THEN
        RAISE EXCEPTION 'Too easy, do a GROUP BY query instead';
    END IF;

    a = array[]::int4[];

    -- first, we find out what is the maximal possible number of hashIds
    -- on a given level
    SELECT max(cnt) INTO maxcnt FROM count_matches(arri);
    -- then we check each combination that yield the found number
    -- of unique hashIds
    FOR arr IN SELECT cm.arr FROM count_matches(arri) cm
       WHERE cm.cnt = maxcnt
    LOOP
        -- if we're on the deepest level of recursion,
        -- we just return back the found combination
        IF curi = maxi THEN
            RETURN NEXT ;
        ELSE
            -- otherwise we ask further down
            FOR b IN SELECT * FROM max_matches(maxi, arr, curi+1) LOOP
            -- this loop and IF clause are required to eliminate
            -- equal arrays, so that if we get {6,14} and {14,6} returned
            -- we will use only one of the two, as they're the same
                IF NOT a @> b THEN
                    a = array_cat(a, b);
                    RETURN QUERY SELECT b;
                END IF;
            END LOOP;
        END IF;
    END LOOP;

    RETURN ;
END;
$$ LANGUAGE plpgsql;

Unfortunately this approach is time consuming. For my test setup I have the following performance, which seems overkill to spend 8 seconds for 200-rows "big" table.

select * from max_matches(10);
             arr
-----------------------------
 {6,14,4,16,8,1,7,10,11,18}
 {6,14,4,16,8,1,7,11,12,18}
 {6,14,4,16,8,7,10,11,15,18}
 {6,14,4,16,11,10,1,7,18,20}
(4 rows)

Time: 8034,700 ms

I hope you don't mind me jumping into this question. And I also hope you will find my answer useful for your purposes at least partially :)

And thanks for the question, I have had a very good time trying to solve it!

Upvotes: 2

silvio

Reputation: 899

Here's a solution which might be good enough. Indexes and/or manually created caching tables might help performance figures, although on a sparsely populated table, it works instantly.

The idea is brutally simple: create a view to explicitly show all possibilities, then combine all of them and find the best by ordering. The same-rule combinations are allowed as certain rule by itself might be more efficient than a combination of others.

Based on table similar to one you described above, with columns named "id", "hash_id" and "rule_id", create a helper view (this way it's easier to test/debug) using the following select:

SELECT `t1`.`hash_id` AS `h1`,`t2`.`hash_id` AS `h2`,`t3`.`hash_id` AS `h3`,`t1`.`rule_id` AS `r1`,`t2`.`rule_id` AS `r2`,`t3`.`rule_id` AS `r3` from (`hashTable` `t1` join `hashTable` `t2` join `hashTable` `t3`)

The above view is set to create a triple-join table. You can add t4.hash_id as h4,t4.rule_id as r4 to the SELECT and join hashTable t4 to the FROM to add a fourth join, and so forth until 10.

After creating the view, the following query gives the combination of 2 best rules with their hash coverage explicitly shown:

select group_concat(distinct h1),concat(r1, r2) from (select distinct h1,r1,r2 from hashView union distinct select distinct h2,r1,r2 from hashView) as uu group by concat(r1,r2)

If you don't need to see the hash coverage, this one may be better:

select count(distinct h1) as cc,concat(r1, r2) from (select distinct h1,r1,r2 from hashView union distinct select distinct h2,r1,r2 from hashView) as uu group by concat(r1,r2) order by cc

Adding 3rd rule match is simple by adding h3 and r3 to the union and grouping using it:

select count(distinct h1),concat(r1, r2, r3) from (select distinct h1,r1,r2,r3 from hashView union distinct select distinct h2,r1,r2,r3 from hashView union distinct select distinct h3,r1,r2,r3 from hashView) as uu group by concat(r1,r2,r3)

If you do not need the option to choose how many top rules to match, you can do the concat() in the View itself and save some time on the union queries.

A possible performance increase is eliminating permuted rule id's.

All above were only tested using single-digit rule id's, so instead of concat(), you should probably use concat_ws(), like this, for pre-concatted view:

select `t1`.`hash_id` AS `h1`,`t2`.`hash_id` AS `h2`,`t3`.`hash_id` AS `h3`,concat_ws(",",`t1`.`rule_id`,`t2`.`rule_id`,`t3`.`rule_id`) AS `r` from (`hashTable` `t1` join `hashTable` `t2` join `hashTable` `t3`)

And then the union query:

select count(distinct h1) as cc,r from (select distinct h1,r from hashView union distinct select distinct h2,r from hashView union distinct select distinct h3,r from hashView) as uu group by r order by cc

Let know if this solves the problem at hand or if there are additional constraints that weren't disclosed before.

Depending on amount of rules and hashes, you can also always reverse the rule<->hash relation and instead create hash-based views.

Best idea is probably to combine this approach with real-life heuristics.

Upvotes: 1

JotaBe

Reputation: 39015

If you really want to find the best solution (optimal solution), the problem is that you have to check all the possible combinations of 10 ruleIds, and find how many hashIds are returned by each of this possible combination. The problem is that the number of combinations is grossly the different number of ruleids ^ 10 (in fact, the number is smaller, if you consider that you cannot repeat the same ruleIds in the combinations... its a combination of m elements taken in groups of 10).

NOTE: To be exact, the number of possible combinations is

m!/(n!(m-n)!) => m!/(10!(m-10!)) where ! is factorial: m! = m * m-1 * m-2... * 3 * 2 * 1

To do this combinations you have to join your table with itself, 10 times, excluding the previous combinations of ruleids, somewhat like this:

select m1.ruleid r1, m2.ruleid r2, m3.ruleid r3 ...
from matches m1 inner join matches m2 on m2<>m1 
   inner join matches m3 on m3 <> m1 and m3 <> m2
     ...

Then you have to find the highest count of

select r1, r2, r3..., count(distinct hashid) 
from ("here the combinations of 10 ruleIds define above") G10
inner join M
  on ruleid = r1 or ruleid = r2 or ruleid=r3...
group by r1, r2, r3...

This gigantic query would take a lot of time to run.

There can be much faster procedures that will give you sub-optimal results.

SOME OPTIMIZATION:

This could be somewhat optimized, depending on the data shape, looking for groups which are equal to or included in other groups. This would require less than (m*(m+1))/2 operations, which compared to the other number, it's a big deal, specially if it's quite probable to find several groups which can be discarded, which will lower m. Anyway, the main has still a gigantic cost.

Upvotes: 3

Michael Fredrickson

Reputation: 37388

I think your problem is a variation of the "knapsack problem".

I think you already understand that you can't just take whatever ruleIds match the most hashIds like the other answers are suggesting, because while each of those ruleIds match say 100 hashIds, they might all match the same 100 hashIds... but if you had selected 10 other ruleIds which only matched 25 hashIds, but with each of the hashIds matched by each ruleId being unique, you'd end up with more unique hashIds with that set.

To solve this, you could start by selecting whatever ruleId matches the most hashIds, and then next selecting whatever ruleId matches the most hashIds that aren't included in the hashIds matched by the previous ruleIds... continuing this process until you've selected 10 ruleIds.

There could still be anomalies in your data distribution that would cause this to not produce an optimal set of ruleIds... so if you wanted to go crazy, you could consider implementing a genetic algorithm to try to improve the "fitness" of your set of 10 ruleIds.

This isn't a task that SQL is particularly well suited to handle, but here's an example of the knapsack problem being solved with a genetic algorithm written in SQL(!)

EDIT

Here's an untested implementation of the solution where ruleIds are selected 1 at a time, with each iteration selecting whatever ruleId has the most unique hashIds that weren't previously covered by any other selected ruleIds:

--------------------------------------------------------------------------
-- Create Test Data
--------------------------------------------------------------------------
create create matches (
  id  int(10) unsigned not null auto_increment,
  hashId int(10) unsigned not null,
  ruleId int(10) unsigned not null,
  primary key (id)
);

insert into matches (hashid, ruleid)
values 
(1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7,1), (8,1), (9,1), (10,1),
(1,2), (2,2), (3,2), (4,2), (5,2), (6,2), (7,2), (8,2), (9,2), (10,2),
(1,3), (2,3), (3,3), (4,3), (5,3), (6,3), (7,3), (8,3), (9,3), (10,3),
(1,4), (2,4), (3,4), (4,4), (5,4), (6,4), (7,4), (8,4), (9,4), (10,4),
(1,5), (2,5), (3,5), (4,5), (5,5), (6,5), (7,5), (8,5), (9,5), (10,5),
(1,6), (2,6), (3,6), (4,6), (5,6), (6,6), (7,6), (8,6), (9,6), (10,6),
(1,7), (2,7), (3,7), (4,7), (5,7), (6,7), (7,7), (8,7), (9,7), (10,7),
(1,8), (2,8), (3,8), (4,8), (5,8), (6,8), (7,8), (8,8), (9,8), (10,8),
(1,9), (2,9), (3,9), (4,9), (5,9), (6,9), (7,9), (8,9), (9,9), (10,9),
(11,10), (12,10), (13,10), (14,10), (15,10),
(11,11), (12,11), (13,11), (14,11), (15,11),
(16,12), (17,12), (18,12), (19,12), (20,12),
(21,13), (22,13), (23,13), (24,13), (25,13),
(26,14), (27,14), (28,14), (29,14), (30,14),
(31,15), (32,15), (33,15), (34,15), (35,15),
(36,16), (37,16), (38,16), (39,16), (40,16),
(41,17), (42,17), (43,17), (44,17), (45,17),
(46,18), (47,18), (48,18), (49,18), (50,18),
(51,19), (52,19), (53,19), (54,19), (55,19),
(56,20), (57,20), (58,20), (59,20), (60,20)
--------------------------------------------------------------------------
-- End Create Test Data
--------------------------------------------------------------------------

create table selectedRules (
  ruleId int(10) unsigned not null
);

set @rulesSelected = 0;

while (@rulesSelected < 10) do
  insert into selectedRules (ruleId)
  select m.ruleId
  from 
    matches m left join (
      select distinct m2.hashId
      from
        selectedRules sr join
        matches m2 on m2.ruleId = sr.ruleId
      ) prev on prev.hashId = m.hashId
  where prev.hashId is null
  group by m.ruleId
  order by count(distinct m.hashId) desc
  limit 1;

  set @rulesSelected = @rulesSelected + 1;
end while;

select ruleId from selectedRules;

Upvotes: 11

SQL to select set of 10 records that collectively best fulfill a criterion

Answers (5)

Related Questions