Tim
Tim

Reputation: 75

Smallest ID from an Undirected graph

Working with Oracle 11g, I'm trying to find the smallest ID in a undirected graph of ID pairs.

Using the following pairs:

create table unique_pairs ( ID1 INT, ID2 INT );
insert all 
    into unique_pairs (ID1,ID2) VALUES ( 1, 2 )  
    into unique_pairs (ID1,ID2) VALUES ( 1, 3 )
    into unique_pairs (ID1,ID2) VALUES ( 4, 2 )
    into unique_pairs (ID1,ID2) VALUES ( 4, 5 )
    into unique_pairs (ID1,ID2) VALUES ( 7, 6 )
    into unique_pairs (ID1,ID2) VALUES ( 8, 10 )
    into unique_pairs (ID1,ID2) VALUES ( 10, 7 )
select * from dual;
commit;

I'm using a union of two connect by queries to try and travel in two directions of the map:

select distinct a.ID1, a.ID2, min(a.ultimate_parent_id ) as ultimate_parent_id
from
(SELECT distinct ID1, ID2,
    connect_by_root(ID2) ultimate_parent_id
  FROM unique_pairs
CONNECT BY ID2 = prior ID1 AND ID2 != PRIOR ID2 
union
SELECT distinct ID1, ID2,
    connect_by_root(ID1) ultimate_parent_id
  FROM unique_pairs
CONNECT BY ID1 = prior ID2 AND ID1 != PRIOR ID1) a
group by a.ID1, a.ID2
order by 3;

I get this:

   ID1        ID2 ULTIMATE_PARENT_ID

     1          2                  1
     1          3                  1
     4          2                  2
     4          5                  4
    10          7                  6
     7          6                  6
     8         10                  6

and should be this:

   ID1        ID2 ULTIMATE_PARENT_ID

     1          2                  1
     1          3                  1
     4          2                  1
     4          5                  1
    10          7                  6
     7          6                  6
     8         10                  6 

My problem I think relates to the fact that the connect by function is implying a hierarchical structure when my data is non directional. Any help would be great, thank you!

Upvotes: 4

Views: 91

Answers (2)

Sentinel
Sentinel

Reputation: 6449

Collapsar has a very lengthy description, but here's a very brief solution that matches your requirements. I'm too far removed from my graph theory days to explain it as he does, but the simple explanation is that since your unique_pairs represents the edges in an undirected graph it can be represented in a directed graph as the union of ID1, ID2 with ID2, ID1 hence my first common table expression. With that directed graph you can compute all connected pairs of nodes as in my second common table expression. Finally you can join your original unique_pairs table to my connected pairs CTE on ID1 and take the minimum root node as your ULTIMATE_PARENT_ID:

with dg1 as (
  select id1, id2 from unique_pairs
  union all
  select id2, id1 from unique_pairs
), cp as (
select id1
     , CONNECT_BY_ROOT id1 root
  from dg1
connect by nocycle prior id1 = id2 
)
select dg1.*
     , min(root) ULTIMATE_PARENT_ID
  from unique_pairs dg1
  join cp
    on cp.id1 = dg1.id1
  group by dg1.id1, dg1.id2
order by 1,2

See the SQL Fiddle

| ID1 | ID2 | ULTIMATE_PARENT_ID |
|-----|-----|--------------------|
|   1 |   2 |                  1 |
|   1 |   3 |                  1 |
|   4 |   2 |                  1 |
|   4 |   5 |                  1 |
|   7 |   6 |                  6 |
|   8 |  10 |                  6 |
|  10 |   7 |                  6 |

Upvotes: 2

collapsar
collapsar

Reputation: 17238

Concept

Your problem basically consists of 3 steps:

  • identify the connected components of an undirected graph G
  • represent each connected component by its vertex with the minimal id
  • map each edge of the graph to the representative of the connected component it belongs to.

The task is solved by:

  • considering an 'equivalent' directed graph D_1 based on the natural numerical ordering of G's vertex ids.
  • reducing D_1 to a bipartite graph B containing all candidate ids by abstracting away chains in the ordering represented by D_1.
  • determining minimal representative ids in B.
  • establishing a mapping from G's edges onto B's edges and thus its weakly connected components (WCCs).

Details

The edge set of G is precisely the set of pairs (id1, id2) given by the records of unique_pairs. The vertex set of G matches the set of values from both id columns of unique_pairs. As G is undirected, it does not matter whether to represent an edge by the record (id1, id2) or (id2, id1); henceforth we assume wlog that id1 < id2 holds for each record (id1, id2) in unique_pairs.

We tackle the task by constructing an auxiliary structure based on the digraph D_1 derived from G by orienting each edge of the undirected graph according to the numerical order of the ids of its incident vertices. Let D_2 be D_1 with all edge orientations reversed. Identifying the connected components of the original graph amounts to identifying the WCCs in either D_1 or D_2. Observe that D_1 is acyclic by construction.

The assisting auxiliary structure is a mapping of the set of edges in G onto the powerset of maximal chains ( = a path not contained in another path ) in D_1, such that all chains in the image of an edge e contain e and maximize the difference between the numerical id of its terminals ( = start and end vertices ). The use of powersets is a technicality as otherwise the mapping might not be well-defined when two maximal chains share the same terminals but follow 'alternative routes' (usually the image will contain a single element).

In the next step we coarsen the mapping: instead of sets of maximal chains the image will be the pair of terminal ids of any maximal chain from the original mapping's image.

Given this mapping, we build an auxiliary bipartite graph B whose edges precisely represent the ordered pairs of terminal ids. The graph is bipartite since no end vertex of a maximal chain can be a start vertex of another maximal chain and vice versa - if this happened, said chains wouldn't have been maximal.

However, by construction, B forms a part of D_1's transitive closure. Therefore, the vertex sets of B's WCCs are subsets of the vertex sets of D_1's WCCs. We are ultimately interested in the vertex with the minimal id in each of the latter WCCs. As no terminal id contained in a maximal chain can be smaller than the start vertex id, all of these vertices must be contained in B's vertex set. So we have to consider B's WCCs only taking the minimum terminal id of each of them. Together with the auxiliary mapping of all G edges onto B edges and thus onto B's WCCs, we solve the original problem.

SQL implementation

In a preprocessing step, G's edges are normalized to the natural ordering of its incident vertices. Assuming that the orientation of the edges is id1 -> id2, test_unique_pairs also represents D_1.

update test_unique_pairs
   set id1 = id2
     , id2 = id1
 where id1 > id2
     ;

The first query maps the edges of D_1 to the edges of B, ie. the pairs of terminal ids of maximal chains in D_1. These chains are acyclic by construction and will be computed by a hierarchical query on the test_unique_pairs and they will be represented by their start and end vertices, these pairs also representing edges in B. This mapping occurs modulo alternative chains sharing both terminals - different maximal paths are considered equivalent if they share both terminals. Hierarchical queries in oracle allow for easy extraction of the maximal elements. The minimal elements, however, are the maximal elements of the hierarchy with parent-child relationships reversed (to cater for this computation is the reason for considering both digraphs D_1, D_2).

In order to ease the second processing stage, the query id wrapped into a view definition:

  create or replace view test_up_chains as
           select *
             from ( -- collect minimal/maximal reachable id for each edge in G
                          select distinct
                                 inter.id1
                               , inter.id2
                               , min(minelt)   chainstart
                               , max(maxelt)   chainend
                            from ( -- collect minimum/maximum terminal id for chains starting/ending with any of G's edges
                                         select distinct
                                                inter_min.id1
                                              , inter_min.id2
                                              , inter_min.minelt
                                              , inter_max.maxelt
                                           from ( -- hierarchical query for maximal chains in D_1
                                                        select distinct
                                                               hup.ID1
                                                             , hup.ID2
                                                             , CONNECT_BY_ROOT hup.id2   maxelt
                                                          from test_unique_pairs hup
                                                    connect by nocycle prior hup.id1 = hup.id2
                                                    start with hup.id2 in (
                                                                 select ref1.id2
                                                                   from test_unique_pairs ref1
                                                                  where not exists (
                                                                           select 1
                                                                             from test_unique_pairs ref2
                                                                            where ref2.id1 = ref1.id2
                                                                        )
                                                               )
                                                 ) inter_max
                                            join ( -- hierarchical query for maximal chains in D_2 ( = D_1 with reversed edge orientations )
                                                        select distinct
                                                               hup.ID1
                                                             , hup.ID2
                                                             , CONNECT_BY_ROOT hup.id1   minelt
                                                          from test_unique_pairs hup
                                                    connect by nocycle prior hup.id2 = hup.id1
                                                    start with hup.id1 in (
                                                                 select ref1.id1
                                                                   from test_unique_pairs ref1
                                                                  where not exists (
                                                                           select 1
                                                                             from test_unique_pairs ref2
                                                                            where ref2.id2 = ref1.id1
                                                                        )
                                                               )
                                                 ) inter_min
                                              on ( inter_min.id1 = inter_max.id1 AND inter_min.id2 = inter_max.id2 )
                                  ) inter
                         group by grouping sets ( (inter.id1, inter.id2) )
                   ) base
          order by base.id1
                 , base.id2
;

Next the edges of B are coalesced into WCCs and their respective minimal ids extracted. The following hierarchical query, lacking a STARTS WITH clause, builds the complete transitive closure wrt the edge incidence relation, and for each edge in B produces root paths to all other edges in the same WCC. Thus minimizing over the start vertex of all hierarchy 'root's given an edge in G finds the minimal reachable id in G. Cycles are suppressed by the NOCYCLE directive.

                 select *
                   from (
                          select distinct
                                 id1
                               , id2
                               , min(chainlim)   chainrep
                            from (
                                         select distinct
                                                id1
                                              , id2
                                              , connect_by_root chainstart chainlim
                                           from test_up_chains
                                     connect by nocycle ( ( prior chainstart = chainstart and prior chainend != chainend ) OR ( prior chainstart != chainstart and prior chainend = chainend ) )
                                 ) base
                          group by grouping sets ( ( base.id1, base.id2 ) )
                        )
               order by id1
                      , id2
     ;    

Upvotes: 1

Related Questions