Clare Nia
Clare Nia

Reputation: 105

Efficiently querying a directed/undirected table of graph edges in SQL Server

I've got a SQL server table in which each row represents an edge in a graph network. The FromNodeID and ToNodeID are foreign keys to a node table, and the schema is something like this:

CREATE TABLE #Edges (
  EdgeID int identity (1,1),
  FromNodeID int,
  ToNodeID int
  );

INSERT INTO #Edges (FromNodeID, ToNodeID) VALUES
  (1,2),
  (1,3),
  (1,4),
  (2,3),
  (3,5),
  (4,5),
  (5,6);

Now, if I consider each edge to be directed (i.e., one way), then it's easy to work out all those nodes that I can get to directly from any node. I'd add an index to the FromNodeID column, and then run a query like this:

SELECT ToNodeID FROM #Edges WHERE FromNodeID = 3

Result: 5

But what would be the best way to structure my table/query if I want to treat each edge as unidirectional. i.e. starting from node 3, I'd like to get the results:

Result: 1, 2, 5

The simplest way I can think of would be to add an additional index to the ToNodeID column and then run a query like this:

SELECT ToNodeID FROM #Edges WHERE FromNodeID = 3 
UNION SELECT FromNodeID FROM #Edges WHERE ToNodeID = 3;

But this obviously involves combining result sets from two queries and doesn't seem very efficient - is there a better way to write this in a single query? (Note that I don't want to insert the reversed edges again into the table - I need to be able to treat the edges as either directed or undirected at runtime).

Thanks for any advice!

Upvotes: 6

Views: 4575

Answers (3)

Quassnoi
Quassnoi

Reputation: 425863

But this obviously involves combining result sets from two queries and doesn't seem very efficient - is there a better way to write this in a single query?

This is efficient enough.

You could do this:

SELECT  CASE 3 WHEN FromNodeId THEN ToNodeId ELSE FromNodeId END
FROM    Edges
WHERE   3 IN (FromNodeId, ToNodeId)

but this will be essentially the same (will UNION these indexes under the hood).

Here's a script to test:

CREATE TABLE #Edges
        (
        EdgeID INT IDENTITY (1,1) PRIMARY KEY,
        FromNodeID int NOT NULL,
        ToNodeID int NOT NULL
        )
CREATE INDEX ix_edges_from ON #Edges (FromNodeID, ToNodeId)
CREATE INDEX ix_edges_to ON #Edges (ToNodeID, FromNodeId)
;
WITH    q (rn) AS
        (
        SELECT  1
        UNION ALL
        SELECT  rn + 1
        FROM    q
        WHERE   rn < 1000
        )
INSERT
INTO    #Edges (FromNodeId, ToNodeId)
SELECT  q1.rn, q2.rn
FROM    q q1
CROSS JOIN
        q q2
WHERE   (q1.rn + q2.rn) % 37 = 0
OPTION (MAXRECURSION 0)

For the UNION:

SELECT  ToNodeId
FROM    #Edges
WHERE   FromNodeId = 3
UNION
SELECT  FromNodeId
FROM    #Edges
WHERE   ToNodeId = 3


  |--Stream Aggregate(GROUP BY:([Union1006]))
       |--Merge Join(Concatenation)
            |--Index Seek(OBJECT:([tempdb].[dbo].[#Edges]), SEEK:([tempdb].[dbo].[#Edges].[FromNodeID]=(3)) ORDERED FORWARD)
            |--Index Seek(OBJECT:([tempdb].[dbo].[#Edges]), SEEK:([tempdb].[dbo].[#Edges].[ToNodeID]=(3)) ORDERED FORWARD)

For the IN:

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN (3)=[tempdb].[dbo].[#Edges].[FromNodeID] THEN [tempdb].[dbo].[#Edges].[ToNodeID] ELSE [tempdb].[dbo].[#Edges].[FromNodeID] END))
       |--Sort(DISTINCT ORDER BY:([tempdb].[dbo].[#Edges].[EdgeID] ASC))
            |--Concatenation
                 |--Index Seek(OBJECT:([tempdb].[dbo].[#Edges]), SEEK:([tempdb].[dbo].[#Edges].[FromNodeID]=(3)) ORDERED FORWARD)
                 |--Index Seek(OBJECT:([tempdb].[dbo].[#Edges]), SEEK:([tempdb].[dbo].[#Edges].[ToNodeID]=(3)) ORDERED FORWARD)

As you can see, the plans are essentially the same: they both take values from the corresponding indexes and concatenate the results.

The UNION query is in fact a little more efficient, since it uses a Merge Join to concatenate the results, and the records come out of the merge join naturally ordered, so the Stream Aggregate does not need to sort.

Upvotes: 5

outis
outis

Reputation: 77450

There are three options I can think of: do it only in the table, only in the queries, or create a view. For the table, create triggers that enforce the symmetric closure (e.g. when inserting (a,b), also insert (b,a); when updating (a,b) to (c,d), delete the old symmetry-preserving (b,a) pair, then insert (d,c)). Note that this may not work, as some RDBMSs (I'm not sure if SQL Server is one) don't allow insertions/updates to the table that a trigger fired on.

In queries,

SELECT CASE FromNodeID WHEN 3 THEN ToNodeId ELSE FromNodeId END
  FROM #Edges 
    WHERE FromNodeID=3 OR ToNodeID=3

For the view, create one that's the symmetric closure of the original table. I think you'd still have to use a UNION, but it could simplify query writing.

CREATE VIEW undirected_edges (FirstEdgeID,SecondEdgeId)
  AS (SELECT FromNodeID, ToNodeID FROM #Edges)
  UNION DISTINCT
    (SELECT ToNodeID, FromNodeID FROM #Edges)

Upvotes: 0

Kevin A. Naud&#233;
Kevin A. Naud&#233;

Reputation: 4080

Must you process the graph directly from SQL Server? If you are really concerned about performance, you should use one of the data-structures specifically for representing and processing graphs. Most of the work I have done with graphs (and I have done plenty) would have been infeasible if I used a generic database backend to consult the graphs.

One of the most effective representations that I have used is described in the appendices of a compiler book I have: Engineering a Compiler, by Keith Cooper and Linda Torczon.

Upvotes: 1

Related Questions