Reputation: 1570

PostgreSQL 9.3 - Compare two sets of data without duplicating values in first set

I have a group of tables that define some rules that need to be followed, for example:

CREATE TABLE foo.subrules (
    subruleid SERIAL PRIMARY KEY,
    ruleid INTEGER REFERENCES foo.rules(ruleid),
    subrule INTEGER,
    barid INTEGER REFERENCES foo.bars(barid)
);

INSERT INTO foo.subrules(ruleid,subrule,barid) VALUES 
    (1,1,1),
    (1,1,2),
    (1,2,2),
    (1,2,3),
    (1,2,4),
    (1,3,3),
    (1,3,4),
    (1,3,5),
    (1,3,6),
    (1,3,7);

What this is defining is a set of "subrules" that need to be satisfied... if all "subrules" are satisfied then the rule is also satisfied. In the above example, "subruleid" 1 can be satisfied with a "barid" value of 1 or 2. Additionally, "subruleid" 2 can be satisfied with a "barid" value of 2, 3, or 4. Likewise, "subruleid" 3 can be satisfied with a "barid" value of 3, 4, 5, 6, or 7.

I also have a data set that looks like this:

 primarykey |  resource  |   barid  
------------|------------|------------
     1      |     A      |     1      
     2      |     B      |     2      
     3      |     C      |     8

The tricky part is that once a "subrule" is satisfied with a "resource", that "resource" can't satisfy any other "subrule" (even if the same "barid" would satisfy the other "subrule")

So, what I need is to evaluate and return the following results:

   ruleid   |   subrule  |   barid    | primarykey |  resource  
------------|------------|------------|------------|------------
     1      |     1      |     1      |     1      |     A      
     1      |     1      |     2      |    NULL    |    NULL
     1      |     2      |     2      |     2      |     B      
     1      |     2      |     3      |    NULL    |    NULL
     1      |     2      |     4      |    NULL    |    NULL
     1      |     3      |     3      |    NULL    |    NULL    
     1      |     3      |     4      |    NULL    |    NULL
     1      |     3      |     5      |    NULL    |    NULL
     1      |     3      |     6      |    NULL    |    NULL
     1      |     3      |     7      |    NULL    |    NULL
    NULL    |    NULL    |    NULL    |     3      |     C

Interestingly, if "primarykey" 3 had a "barid" value of 2 (instead of 8) the results would be identical.

I have tried several methods including a plpgsql function that performs a grouping by "subruleid" with ARRAY_AGG(barid) and building an array from barid and checking if each element in the barid array is in the "subruleid" group via a loop, but it just doesn't feel right.

Is a more elegant or efficient option available?

Upvotes: 1

Answers (2)

Erwin Brandstetter

Reputation: 657787

Since you are not clarifying the question, I am going with my own assumptions.

subrule numbers are ascending without gaps for each rule.
(subrule, barid) is UNIQUE in table subrules.
If a there are multiple resources for the same barid, assignments are arbitrary among these peers.
As commented, the number of resources matches the number of subrules (which has no effect on my suggested solution).
The algorithm is as follows:
1. Pick the subrule with the smallest subrule number.
2. Assign a resource to the lowest barid possible (the first that has a matching resource), which consumes the resource.
3. After the first resource is matched, skip to the next higher subruleid and repeat 2.
4. Append all remaining resources after last subrule.

You can implement this with pure SQL using a recursive CTE:

WITH RECURSIVE cte AS ((
   SELECT s.*, r.resourceid, r.resource
        , CASE WHEN r.resourceid IS NULL THEN '{}'::int[]
               ELSE ARRAY[r.resourceid] END AS consumed
   FROM   subrules s
   LEFT   JOIN resource r USING (barid)
   WHERE  s.ruleid = 1
   ORDER  BY s.subrule, r.barid, s.barid
   LIMIT  1
   )
   UNION ALL (
   SELECT s.*, r.resourceid, r.resource
        , CASE WHEN r.resourceid IS NULL THEN c.consumed
                                         ELSE c.consumed || r.resourceid END
   FROM   cte           c
   JOIN   subrules      s ON s.subrule = c.subrule + 1
   LEFT   JOIN resource r ON r.barid = s.barid
                         AND r.resourceid <> ALL (c.consumed)
   ORDER  BY r.barid, s.barid
   LIMIT  1
   ))
SELECT ruleid, subrule, barid, resourceid, resource FROM cte

UNION ALL  -- add unused rules
SELECT s.ruleid, s.subrule, s.barid, NULL, NULL 
FROM   subrules s
LEFT   JOIN cte c USING (subruleid)
WHERE  c.subruleid IS NULL

UNION ALL  -- add unused resources
SELECT NULL, NULL, r.barid, r.resourceid, r.resource
FROM   resource r
LEFT   JOIN cte c USING (resourceid)
WHERE  c.resourceid IS NULL    
ORDER  BY subrule, barid, resourceid;

Returns exactly the result you have been asking for.
SQL Fiddle.

Explain

It's basically an implementation of the algorithm laid out above.

Only take a single match on a single barid per subrule. Hence the LIMIT 1, which requires additional parentheses:
- Sum results of a few queries and then find top 5 in SQL
Collecting "consumed" resources in the array consumed and exclude them from repeated assignment with r.resourceid <> ALL (c.consumed). Note in particular how I avoid NULL values in the array, which would break the test.
The CTE only returns matched rows. Add rules and resources without match in the outer SELECT to get the complete result.

Or you open two cursors on the tables subrule and resource and implement the algorithm with any decent programming language (including PL/pgSQL).

Upvotes: 1

wildplasser

Reputation: 44250

The following fragment finds solutions, if there are any. The number three (resources) is hardcoded. If only one solution is needed some symmetry-breaker should be added.

If the number of resources is not bounded, I think there could be a solution by enumerating all possible tableaux (Hilbert? mixed-radix?), and selecting from them, after pruning the not-satifying ones.

 -- the data
CREATE TABLE subrules
    ( subruleid SERIAL PRIMARY KEY
    , ruleid INTEGER -- REFERENCES foo.rules(ruleid),
    , subrule INTEGER
    , barid INTEGER -- REFERENCES foo.bars(barid)
);

INSERT INTO subrules(ruleid,subrule,barid) VALUES
    (1,1,1), (1,1,2),
    (1,2,2), (1,2,3), (1,2,4),
    (1,3,3), (1,3,4), (1,3,5), (1,3,6), (1,3,7);

CREATE TABLE resources
    ( primarykey INTEGER NOT NULL PRIMARY KEY
    ,  resrc  varchar
    ,  barid  INTEGER NOT NULL
        );

INSERT INTO resources(primarykey,resrc,barid) VALUES
      (1, 'A', 1) ,(2, 'B', 2) ,(3, 'C', 8)
        -- ################################
        -- uncomment next line to find a (two!) solution(s)
     -- ,(4, 'D', 7)
        ;

-- all matching pairs of subrules <--> resources
WITH pairs AS (
        SELECT sr.subruleid, sr.ruleid, sr.subrule, sr.barid
        , re.primarykey, re.resrc
        FROM subrules sr
        JOIN resources re ON re.barid = sr.barid
        )
SELECT
        p1.ruleid AS ru1 , p1.subrule AS sr1 , p1.resrc AS one
        , p2.ruleid AS ru2 , p2.subrule AS sr2 , p2.resrc AS two
        , p3.ruleid AS ru3 , p3.subrule AS sr3 , p3.resrc AS three
  -- self-join the pairs, excluding the ones that
  -- use the same subrule or resource
FROM pairs p1
JOIN pairs p2 ON p2.primarykey > p1.primarykey -- tie-breaker
JOIN pairs p3 ON p3.primarykey > p2.primarykey -- tie breaker
WHERE 1=1
AND p2.subruleid <> p1.subruleid
AND p2.subruleid <> p3.subruleid
AND p3.subruleid <> p1.subruleid
        ;

Result (after uncommenting the line with missing resource) :

 ru1 | sr1 | one | ru2 | sr2 | two | ru3 | sr3 | three 
-----+-----+-----+-----+-----+-----+-----+-----+-------
   1 |   1 | A   |   1 |   1 | B   |   1 |   3 | D
   1 |   1 | A   |   1 |   2 | B   |   1 |   3 | D
(2 rows)

The resources {A,B,C} could of course be hard-coded, but that would prevent the 'D' record (or any other) to serve as the missing link.

Upvotes: 2

PostgreSQL 9.3 - Compare two sets of data without duplicating values in first set

Answers (2)

Explain

Related Questions