compare data sets and return best match

Question

In mysql, I am using "join tables" to assign tags to items. I would like to see which items have the most similar tags to the item being viewed.

For example, let's say the item of interest has been tagged "cool", "cars" and "red". I would like to search for other items with those tags. I want to see items that have been tagged "cars", but I want items that were tagged "cars" and "red" to be above the item that was only tagged "cars". I want items with identical tags to be at the top of the results.

Is there some way to compare a data set (subquery) with another data set (subquery) using IN? Alternately, is there some trick I can use to evaluate them as comma separated lists using GROUP BY and GROUP_CONCAT()?

Wouter van Nifterick · Accepted Answer

It would help if you show us your table structures, so I can be more specific.

I'm assuming you've got a structure that resembles this:

Table item: (id, itemname)
1 item1
2 item2
3 item3
4 item4
5 item5

Table tag: (id, tagname)
1 cool
2 red
3 car

Table itemtag: (id, itemid, tagid)
1 1 2 (=item1, red)
2 2 1 (=item2, cool)
3 2 3 (=item2, car)
4 3 1 (=item3, cool)
5 3 2 (=item3, red)
6 3 3 (=item3, car)
7 4 3 (=item3, car)
8 5 3 (=item3, car)

In general my approach would be to start out by counting each separate tag.

-- make a list of how often a tag was used:
select tagid, count(*) as `tagscore` from itemtag group by tagid

This shows a row for each tag that was assigned to the item, with a score.

In our example, that would be:

tag  tagscore
1    2         (cool, 2x)
2    2         (red, 2x)
3    4         (car, 4x)


set @ItemOfInterest=2;

select
  itemname,
  sum(tagscore) as `totaltagscore`,
  GROUP_CONCAT(tags) as `tags`
from
  itemtag
join item on itemtag.itemid=item.id

join
  /* join the query from above (scores per tag) */
  (select tagid, count(*) as `tagscore` from itemtag group by tagid ) as `TagScores`
  on `TagScores`.tagid=itemtag.tagid
where
  itemid<>@ItemOfInterest and 
  /* get the taglist of the current item */
  tagid in (select distinct tagid from itemtag where itemid=@ItemOfInterest)
group by
  itemid
order by
  2 desc

Explanation: The query has 2 subqueries: One is to obtain the list tags from the item of interest. We only want to work with those. The other subquery generates a list of scores per tag.

So in the end, each item in the database has a list of tag scores. Those scores are added up with sum(tagscore), and that number is used to order the result (highest scores on top).

To show a list of available tags, I've used GROUP_CONCAT.

The query will result in something like this (I've made the actual data up here):

Item   TagsScore   Tags
item3  15          red,cool,car
item4   7          red,car
item5   7          red
item1   5          car
item6   5          car

compare data sets and return best match

Answers (2)

Related Questions