cs0815
cs0815

Reputation: 17388

SQL Server intersect query

I have a database defined as follows:

create table Classes (
   Id INT not null,
   Text NVARCHAR(255) null,
   primary key (Id)
)

create table Documents (
   Id INT not null,
   Title NVARCHAR(MAX) null,
   Abstract NVARCHAR(MAX) null,
   Year INT null,
   primary key (Id)
)

create table Documents_Tokens (
   DocumentFk INT not null,
   TokenFk INT not null
)

create table Documents_Classes (
   DocumentFk INT not null,
   ClassFk INT not null
)

create table Tokens (
   Id INT not null,
   Text NVARCHAR(255) null,
   primary key (Id)
)

There is a m:m relationship between documents and classes and documents and tokens.

I would like to determine certain stats. One stat is A, which measures the co-occurrence of classes and tokens. I currently determine this stat like this:

with combs as
(
    select 
        a.Id as classid, 
        a.text as class,
        b.Id as tokenid, 
        b.text as token
        from dbo.Classes as a
        cross join dbo.Tokens as b
)
,A as
(
    select token, class, count(distinct DocumentFk) as A from
    (
        select
            token,
            class,
            DocumentFk
        from combs
        inner join dbo.Documents_Classes on classid = ClassFk
        group by token, DocumentFk, class
            intersect
        select
            token,
            class,
            DocumentFk
        from combs
        inner join dbo.Documents_Tokens on tokenid = tokenFk
        group by token, DocumentFk, class
    ) T group by token, class
)
...

Unfortunately, this query takes ages (I have added indexes after running the query analyser). Is this the most efficient way to determine A? If not is there a better way? I could also change the underlying database structure to potentially speed things up ...

Any feedback would be very much appreciated.

Upvotes: 1

Views: 677

Answers (2)

RichardTheKiwi
RichardTheKiwi

Reputation: 107706

First of all even though you have made it a CTE, bear in mind that because you have the CTE (combs) twice in the query, the CROSS JOIN is performed TWICE!

The second is that once that becomes clear, you should only carry forward the IDs and at the end join to get the text.

The first part is really

    select
        token,
        class,
        DocumentFk
    from (--> expanded = tokens cross join classes <--)
    inner join dbo.Documents_Classes on classid = ClassFk
    group by token, DocumentFk, class

Which means, there really is no value adding CLASSES to the mix since it is completed covered by Documents_Classes. The first part may as well be written

    select
        token_id,
        class_id,
        DocumentFk
    from TOKENS
    CROSS join dbo.Documents_Classes
    --group by token_id, DocumentFk, class_id

The next thing to note is that the CROSS JOIN by definition renders the GROUP BY useless - they are always unique.

The 2nd part simplifies to

    select
        token_id,
        class_id,
        DocumentFk
    from CLASSES
    CROSS join dbo.Documents_Tokens

Since

  • A CROSS joins all classes (that have documents) to tokens; and
  • B CROSS joins all tokens (that have documents) to classes

The intersection is clearly simply the tokens that exist in documents INNER JOIN classes that exist in documents on the same documentid, e.g.

select C.text class, T.text token, count(DC.DocumentFk) as CountDocument
from Documents_Classes DC
inner join Documents_Tokens DT on DC.DocumentFk = DT.DocumentFk
inner join Classes C on DC.classFk = c.id
inner join Tokens T on DT.tokenFk = t.id
group by C.text, T.text, C.id, T.id

Upvotes: 3

erikkallen
erikkallen

Reputation: 34391

Rewrite it as a single inner join to make life easier for the optimizer

Upvotes: -1

Related Questions