Reputation: 9613
I have this joining table:
CREATE TABLE [TagMap](
[intItemId] [bigint] NOT NULL,
[intTagId] [bigint] NOT NULL,
CONSTRAINT [PK_TagMap_intItemId] PRIMARY KEY CLUSTERED
(
[intItemId] ASC,
[intTagId] ASC
))
My sproc discovers the tags that are associated with a seedItemId and then selects other intItemIds that are associated with those tags, like this:
declare @baseTags table (intTagId bigint primary key clustered);
INSERT INTO @baseTags (intTagId)
SELECT TOP 20 t1.intTagId
FROM TagMap t1
WHERE t1.intItemId = 776
SELECT TOP 500 t1.intItemId
FROM TagMap t1
JOIN @baseTags t2
ON t1.intTagId = t2.intTagId
GROUP BY t1.intItemId
ORDER BY Count(*) DESC
What I need to to is specify a minimum number of tag associations - say 10 - between two intItemIds below which a value is not returned. In other words in the TagMap table can find ten or more intTagIds that two intItemIds have in common, it's good and we select it - otherwise ignore it.
So, for example given this data:
CREATE TABLE #TagMap(
[intItemId] [bigint] NOT NULL,
[intTagId] [bigint] NOT NULL,
CONSTRAINT [PK_TagMap_intItemId] PRIMARY KEY CLUSTERED
(
[intItemId] ASC,
[intTagId] ASC
))
insert into #TagMap
(intItemId, intTagId)
values
(1, 100),(1, 200),(1, 300),
(2, 100),(2, 200), (2, 500),(2, 600),
(3, 100), (3, 500),(3, 600)
Assume the match threshold is two.
If the seed is intItemId 1, then only intItemId 2 should be returned (it has two matching tag Ids: 100 and 200, while intItemId 3 has one: 100, which is below the threshold).
If the seed is intItemId 2, then both intItemId 1 and 3 should be returned (intItemId 1 matches the tag Ids 100 and 200, while intItemId 3 matches the tag Ids 500 and 600).
If the seed is intItemId 3, then only intItemId 2 should be returned (it has two matching tag Ids: 500 and 600, while intItemId has one: 100 which is below the threshold).
Any ideas how to do this, please?
Cheers, Matt
Upvotes: 2
Views: 164
Reputation: 239764
This seems to do it (vary @InterestItem
to 1,2,3, and it seems to generate the result you've asked for):
declare @InterestItem int
set @InterestItem = 1
declare @Threshold int
set @Threshold = 2
select
tm2.intItemId,
COUNT(*)
from
#TagMap tm1
inner join
#TagMap tm2
on
tm1.intTagId = tm2.intTagId and
tm2.intItemId <> tm1.intItemId
where
tm1.intItemId = @InterestItem
group by
tm2.intItemId
having
COUNT(*) >= @Threshold
I'm currently including the COUNT
in the result set, but that isn't needed to make this work.
Upvotes: 1
Reputation: 86775
SELECT
[foreign].intItemID
FROM
#TagMap AS [primary]
INNER JOIN
#TagMap AS [foreign]
ON [foreign].intTagID = [primary].intTagID
WHERE
[primary].intItemID = 1
GROUP BY
[foreign].intItemID
HAVING
COUNT(distinct [foreign].intTagID) >= @threshold
Note, however, that this scales pretty badly; because the JOIN
searches for "has any of these tags", and then only in the HAVING
clause can you specify "has all of these tags".
From my experience you can make minor optimisations, but the most rewarding approach I found was to cache the results into a more standard mapping table. Updating them as and when necessary. (The contents of the tags data changing very infrequently.)
Upvotes: 2