Reputation:
I have a real question.
I have a database with the schema as follows:
item
tag
item2tag
Basically, each item is tagged as up to 10 things, with varying counts. There are 50,000 items and 50,000 tags, and about 500,000 entries in items2tag. I'd like to find, given one item, the "most similar" item.
By "most similar" I mean the item that has the most similar combination of tags... if something is "cool" twice as much as it is "funny," I want to find all other things that are almost "cool" twice as much as they are "funny." Of course, this should apply to 10 tags, not just 2.
Any ideas?
Upvotes: 1
Views: 847
Reputation: 20621
Given your representation of item-tag relationship as vectors, What you have is an instance of nearest-neighbor search. You may find pointers in the field of Collaborative Filtering.
Upvotes: 0
Reputation: 93565
Well, you can look at linear algebra to give a n dimensional vector to each item, and then compute the distance between items to find the closest items, but that's pretty complex with even small data sets.
Which is why Google came up with Map Reduce. This will probably be your best bet, but even then it's non-trivial.
-Adam
Upvotes: 1