Reputation: 17388
I have a database defined as follows:
create table Classes (
Id INT not null,
Text NVARCHAR(255) null,
primary key (Id)
)
create table Documents (
Id INT not null,
Title NVARCHAR(MAX) null,
Abstract NVARCHAR(MAX) null,
Year INT null,
primary key (Id)
)
create table Documents_Tokens (
DocumentFk INT not null,
TokenFk INT not null
)
create table Documents_Classes (
DocumentFk INT not null,
ClassFk INT not null
)
create table Tokens (
Id INT not null,
Text NVARCHAR(255) null,
primary key (Id)
)
There is a m:m relationship between documents and classes and documents and tokens.
I would like to determine certain stats. One stat is A, which measures the co-occurrence of classes and tokens. I currently determine this stat like this:
with combs as
(
select
a.Id as classid,
a.text as class,
b.Id as tokenid,
b.text as token
from dbo.Classes as a
cross join dbo.Tokens as b
)
,A as
(
select token, class, count(distinct DocumentFk) as A from
(
select
token,
class,
DocumentFk
from combs
inner join dbo.Documents_Classes on classid = ClassFk
group by token, DocumentFk, class
intersect
select
token,
class,
DocumentFk
from combs
inner join dbo.Documents_Tokens on tokenid = tokenFk
group by token, DocumentFk, class
) T group by token, class
)
...
Unfortunately, this query takes ages (I have added indexes after running the query analyser). Is this the most efficient way to determine A? If not is there a better way? I could also change the underlying database structure to potentially speed things up ...
Any feedback would be very much appreciated.
Upvotes: 1
Views: 677
Reputation: 107706
First of all even though you have made it a CTE, bear in mind that because you have the CTE (combs) twice in the query, the CROSS JOIN is performed TWICE!
The second is that once that becomes clear, you should only carry forward the IDs and at the end join to get the text.
The first part is really
select
token,
class,
DocumentFk
from (--> expanded = tokens cross join classes <--)
inner join dbo.Documents_Classes on classid = ClassFk
group by token, DocumentFk, class
Which means, there really is no value adding CLASSES to the mix since it is completed covered by Documents_Classes. The first part may as well be written
select
token_id,
class_id,
DocumentFk
from TOKENS
CROSS join dbo.Documents_Classes
--group by token_id, DocumentFk, class_id
The next thing to note is that the CROSS JOIN by definition renders the GROUP BY useless - they are always unique.
The 2nd part simplifies to
select
token_id,
class_id,
DocumentFk
from CLASSES
CROSS join dbo.Documents_Tokens
Since
The intersection is clearly simply the tokens that exist in documents INNER JOIN classes that exist in documents on the same documentid, e.g.
select C.text class, T.text token, count(DC.DocumentFk) as CountDocument
from Documents_Classes DC
inner join Documents_Tokens DT on DC.DocumentFk = DT.DocumentFk
inner join Classes C on DC.classFk = c.id
inner join Tokens T on DT.tokenFk = t.id
group by C.text, T.text, C.id, T.id
Upvotes: 3
Reputation: 34391
Rewrite it as a single inner join to make life easier for the optimizer
Upvotes: -1