How to match Tagged items based on "similarity"

I have a real question.

I have a database with the schema as follows:

item

id
description
other junk

tag

id
name

item2tag

item_id
tag_id
count

Basically, each item is tagged as up to 10 things, with varying counts. There are 50,000 items and 50,000 tags, and about 500,000 entries in items2tag. I'd like to find, given one item, the "most similar" item.

By "most similar" I mean the item that has the most similar combination of tags... if something is "cool" twice as much as it is "funny," I want to find all other things that are almost "cool" twice as much as they are "funny." Of course, this should apply to 10 tags, not just 2.

Any ideas?

Upvotes: 1

Answers (2)

Yuval F

Reputation: 20621

Given your representation of item-tag relationship as vectors, What you have is an instance of nearest-neighbor search. You may find pointers in the field of Collaborative Filtering.

Upvotes: 0

Adam Davis

Reputation: 93565

Well, you can look at linear algebra to give a n dimensional vector to each item, and then compute the distance between items to find the closest items, but that's pretty complex with even small data sets.

Which is why Google came up with Map Reduce. This will probably be your best bet, but even then it's non-trivial.

-Adam

Upvotes: 1

How to match Tagged items based on &quot;similarity&quot;

Answers (2)

Related Questions

How to match Tagged items based on "similarity"