Bob Tway
Bob Tway

Reputation: 9613

Specifiy minimum number of join points in SQL

I have this joining table:

 CREATE TABLE [TagMap](
[intItemId] [bigint] NOT NULL,
[intTagId] [bigint] NOT NULL,
 CONSTRAINT [PK_TagMap_intItemId] PRIMARY KEY CLUSTERED 
 (
[intItemId] ASC,
[intTagId] ASC
 ))

My sproc discovers the tags that are associated with a seedItemId and then selects other intItemIds that are associated with those tags, like this:

 declare @baseTags table    (intTagId bigint primary key clustered);

INSERT INTO @baseTags (intTagId)
SELECT TOP 20 t1.intTagId
FROM   TagMap t1
WHERE t1.intItemId = 776

 SELECT TOP 500 t1.intItemId
 FROM   TagMap t1
      JOIN @baseTags t2
       ON t1.intTagId = t2.intTagId
 GROUP  BY t1.intItemId
 ORDER  BY Count(*) DESC

What I need to to is specify a minimum number of tag associations - say 10 - between two intItemIds below which a value is not returned. In other words in the TagMap table can find ten or more intTagIds that two intItemIds have in common, it's good and we select it - otherwise ignore it.

So, for example given this data:

 CREATE TABLE #TagMap(
 [intItemId] [bigint] NOT NULL,
 [intTagId] [bigint] NOT NULL,
 CONSTRAINT [PK_TagMap_intItemId] PRIMARY KEY CLUSTERED 
 (
 [intItemId] ASC,
 [intTagId] ASC
 ))

 insert into #TagMap
 (intItemId, intTagId)
 values 
 (1, 100),(1, 200),(1, 300),
 (2, 100),(2, 200),         (2, 500),(2, 600),
 (3, 100),                  (3, 500),(3, 600)

Assume the match threshold is two.

If the seed is intItemId 1, then only intItemId 2 should be returned (it has two matching tag Ids: 100 and 200, while intItemId 3 has one: 100, which is below the threshold).

If the seed is intItemId 2, then both intItemId 1 and 3 should be returned (intItemId 1 matches the tag Ids 100 and 200, while intItemId 3 matches the tag Ids 500 and 600).

If the seed is intItemId 3, then only intItemId 2 should be returned (it has two matching tag Ids: 500 and 600, while intItemId has one: 100 which is below the threshold).

Any ideas how to do this, please?

Cheers, Matt

Upvotes: 2

Views: 164

Answers (2)

Damien_The_Unbeliever
Damien_The_Unbeliever

Reputation: 239764

This seems to do it (vary @InterestItem to 1,2,3, and it seems to generate the result you've asked for):

declare @InterestItem int
set @InterestItem = 1
declare @Threshold int
set @Threshold = 2

select
    tm2.intItemId,
    COUNT(*)
from
    #TagMap tm1
        inner join
    #TagMap tm2
        on
            tm1.intTagId = tm2.intTagId and
            tm2.intItemId <> tm1.intItemId
where
    tm1.intItemId = @InterestItem
group by
    tm2.intItemId
having
    COUNT(*) >= @Threshold

I'm currently including the COUNT in the result set, but that isn't needed to make this work.

Upvotes: 1

MatBailie
MatBailie

Reputation: 86775

SELECT
  [foreign].intItemID
FROM
  #TagMap    AS [primary]
INNER JOIN
  #TagMap    AS [foreign]
    ON [foreign].intTagID = [primary].intTagID
WHERE
  [primary].intItemID = 1
GROUP BY
  [foreign].intItemID
HAVING
  COUNT(distinct [foreign].intTagID) >= @threshold

Note, however, that this scales pretty badly; because the JOIN searches for "has any of these tags", and then only in the HAVING clause can you specify "has all of these tags".

From my experience you can make minor optimisations, but the most rewarding approach I found was to cache the results into a more standard mapping table. Updating them as and when necessary. (The contents of the tags data changing very infrequently.)

Upvotes: 2

Related Questions