Reputation: 75
Working with Oracle 11g, I'm trying to find the smallest ID in a undirected graph of ID pairs.
Using the following pairs:
create table unique_pairs ( ID1 INT, ID2 INT );
insert all
into unique_pairs (ID1,ID2) VALUES ( 1, 2 )
into unique_pairs (ID1,ID2) VALUES ( 1, 3 )
into unique_pairs (ID1,ID2) VALUES ( 4, 2 )
into unique_pairs (ID1,ID2) VALUES ( 4, 5 )
into unique_pairs (ID1,ID2) VALUES ( 7, 6 )
into unique_pairs (ID1,ID2) VALUES ( 8, 10 )
into unique_pairs (ID1,ID2) VALUES ( 10, 7 )
select * from dual;
commit;
I'm using a union of two connect by queries to try and travel in two directions of the map:
select distinct a.ID1, a.ID2, min(a.ultimate_parent_id ) as ultimate_parent_id
from
(SELECT distinct ID1, ID2,
connect_by_root(ID2) ultimate_parent_id
FROM unique_pairs
CONNECT BY ID2 = prior ID1 AND ID2 != PRIOR ID2
union
SELECT distinct ID1, ID2,
connect_by_root(ID1) ultimate_parent_id
FROM unique_pairs
CONNECT BY ID1 = prior ID2 AND ID1 != PRIOR ID1) a
group by a.ID1, a.ID2
order by 3;
I get this:
ID1 ID2 ULTIMATE_PARENT_ID
1 2 1
1 3 1
4 2 2
4 5 4
10 7 6
7 6 6
8 10 6
and should be this:
ID1 ID2 ULTIMATE_PARENT_ID
1 2 1
1 3 1
4 2 1
4 5 1
10 7 6
7 6 6
8 10 6
My problem I think relates to the fact that the connect by function is implying a hierarchical structure when my data is non directional. Any help would be great, thank you!
Upvotes: 4
Views: 91
Reputation: 6449
Collapsar has a very lengthy description, but here's a very brief solution that matches your requirements. I'm too far removed from my graph theory days to explain it as he does, but the simple explanation is that since your unique_pairs represents the edges in an undirected graph it can be represented in a directed graph as the union of ID1, ID2 with ID2, ID1 hence my first common table expression. With that directed graph you can compute all connected pairs of nodes as in my second common table expression. Finally you can join your original unique_pairs table to my connected pairs CTE on ID1 and take the minimum root node as your ULTIMATE_PARENT_ID:
with dg1 as (
select id1, id2 from unique_pairs
union all
select id2, id1 from unique_pairs
), cp as (
select id1
, CONNECT_BY_ROOT id1 root
from dg1
connect by nocycle prior id1 = id2
)
select dg1.*
, min(root) ULTIMATE_PARENT_ID
from unique_pairs dg1
join cp
on cp.id1 = dg1.id1
group by dg1.id1, dg1.id2
order by 1,2
See the SQL Fiddle
| ID1 | ID2 | ULTIMATE_PARENT_ID |
|-----|-----|--------------------|
| 1 | 2 | 1 |
| 1 | 3 | 1 |
| 4 | 2 | 1 |
| 4 | 5 | 1 |
| 7 | 6 | 6 |
| 8 | 10 | 6 |
| 10 | 7 | 6 |
Upvotes: 2
Reputation: 17238
Concept
Your problem basically consists of 3 steps:
G
The task is solved by:
D_1
based on the natural numerical ordering of G
's vertex ids.D_1
to a bipartite graph B
containing all candidate ids by abstracting away chains in the ordering represented by D_1
.B
.G
's edges onto B
's edges and thus its weakly connected components (WCCs).Details
The edge set of G
is precisely the set of pairs (id1, id2)
given by the records of unique_pairs
. The vertex set of G
matches the set of values from both id
columns of unique_pairs
. As G
is undirected, it does not matter whether to represent an edge by the record (id1, id2)
or (id2, id1)
; henceforth we assume wlog that id1 < id2
holds for each record (id1, id2)
in unique_pairs
.
We tackle the task by constructing an auxiliary structure based on the digraph D_1
derived from G
by orienting each edge of the undirected graph according to the numerical order of the ids of its incident vertices. Let D_2
be D_1
with all edge orientations reversed. Identifying the connected components of the original graph amounts to identifying the WCCs in either D_1
or D_2
. Observe that D_1
is acyclic by construction.
The assisting auxiliary structure is a mapping of the set of edges in G
onto the powerset of maximal chains ( = a path not contained in another path ) in D_1
, such that all chains in the image of an edge e
contain e
and maximize the difference between the numerical id of its terminals ( = start and end vertices ). The use of powersets is a technicality as otherwise the mapping might not be well-defined when two maximal chains share the same terminals but follow 'alternative routes' (usually the image will contain a single element).
In the next step we coarsen the mapping: instead of sets of maximal chains the image will be the pair of terminal ids of any maximal chain from the original mapping's image.
Given this mapping, we build an auxiliary bipartite graph B
whose edges precisely represent the ordered pairs of terminal ids. The graph is bipartite since no end vertex of a maximal chain can be a start vertex of another maximal chain and vice versa - if this happened, said chains wouldn't have been maximal.
However, by construction, B
forms a part of D_1
's transitive closure. Therefore, the vertex sets of B
's WCCs are subsets of the vertex sets of D_1
's WCCs. We are ultimately interested in the vertex with the minimal id in each of the latter WCCs. As no terminal id contained in a maximal chain can be smaller than the start vertex id, all of these vertices must be contained in B
's vertex set. So we have to consider B
's WCCs only taking the minimum terminal id of each of them. Together with the auxiliary mapping of all G
edges onto B
edges and thus onto B
's WCCs, we solve the original problem.
SQL implementation
In a preprocessing step, G
's edges are normalized to the natural ordering of its incident vertices. Assuming that the orientation of the edges is id1 -> id2
, test_unique_pairs
also represents D_1
.
update test_unique_pairs
set id1 = id2
, id2 = id1
where id1 > id2
;
The first query maps the edges of D_1
to the edges of B
, ie. the pairs of terminal ids of maximal chains in D_1
. These chains are acyclic by construction and will be computed by a hierarchical query on the test_unique_pairs
and they will be represented by their start and end vertices, these pairs also representing edges in B
. This mapping occurs modulo alternative chains sharing both terminals - different maximal paths are considered equivalent if they share both terminals. Hierarchical queries in oracle allow for easy extraction of the maximal elements. The minimal elements, however, are the maximal elements of the hierarchy with parent-child relationships reversed (to cater for this computation is the reason for considering both digraphs D_1
, D_2
).
In order to ease the second processing stage, the query id wrapped into a view definition:
create or replace view test_up_chains as
select *
from ( -- collect minimal/maximal reachable id for each edge in G
select distinct
inter.id1
, inter.id2
, min(minelt) chainstart
, max(maxelt) chainend
from ( -- collect minimum/maximum terminal id for chains starting/ending with any of G's edges
select distinct
inter_min.id1
, inter_min.id2
, inter_min.minelt
, inter_max.maxelt
from ( -- hierarchical query for maximal chains in D_1
select distinct
hup.ID1
, hup.ID2
, CONNECT_BY_ROOT hup.id2 maxelt
from test_unique_pairs hup
connect by nocycle prior hup.id1 = hup.id2
start with hup.id2 in (
select ref1.id2
from test_unique_pairs ref1
where not exists (
select 1
from test_unique_pairs ref2
where ref2.id1 = ref1.id2
)
)
) inter_max
join ( -- hierarchical query for maximal chains in D_2 ( = D_1 with reversed edge orientations )
select distinct
hup.ID1
, hup.ID2
, CONNECT_BY_ROOT hup.id1 minelt
from test_unique_pairs hup
connect by nocycle prior hup.id2 = hup.id1
start with hup.id1 in (
select ref1.id1
from test_unique_pairs ref1
where not exists (
select 1
from test_unique_pairs ref2
where ref2.id2 = ref1.id1
)
)
) inter_min
on ( inter_min.id1 = inter_max.id1 AND inter_min.id2 = inter_max.id2 )
) inter
group by grouping sets ( (inter.id1, inter.id2) )
) base
order by base.id1
, base.id2
;
Next the edges of B
are coalesced into WCCs and their respective minimal ids extracted. The following hierarchical query, lacking a STARTS WITH
clause, builds the complete transitive closure wrt the edge incidence relation, and for each edge in B
produces root paths to all other edges in the same WCC. Thus minimizing over the start vertex of all hierarchy 'root's given an edge in G
finds the minimal reachable id in G
. Cycles are suppressed by the NOCYCLE
directive.
select *
from (
select distinct
id1
, id2
, min(chainlim) chainrep
from (
select distinct
id1
, id2
, connect_by_root chainstart chainlim
from test_up_chains
connect by nocycle ( ( prior chainstart = chainstart and prior chainend != chainend ) OR ( prior chainstart != chainstart and prior chainend = chainend ) )
) base
group by grouping sets ( ( base.id1, base.id2 ) )
)
order by id1
, id2
;
Upvotes: 1