laurent
laurent

Reputation: 90804

What is the theory to select one item based on various criteria?

I need to solve a problem where an item A must be compared to thousands of other items, and find out which items are the most similar to item A.

I want to assign a weight to each of these items depending on how similar they are to item A. Various criteria will determine the final weight. For instance, if item1.someProperty == otherItem.someProperty, then I increase the weight by 5, if item1.anotherProperty == otherItem.anotherProperty, then I only increase the weight by 1, because someProperty is more important than anotherProperty.

The reason I'm describing all that, is that I want to know if there's any theory that will help me create this system. In particular, how to choose the weight of each criteria, how to compute the final weight of an item, and how to architecture all that.

So does anybody know if there is any theory that could help? Or perhaps there is a better way to do what I'm trying to do?

Upvotes: 4

Views: 109

Answers (3)

wildplasser
wildplasser

Reputation: 44250

You could think of your properties as dimensions and compose a distance out of them. If there is correlation between the properties, you could take that into account as well (google for Mahalanobis distance).

But basically it winds down to

 float distance(a, b) {
    return w1 * ABS(a.x - b.x)
         + w2 * ABS(a.y - b.y)
           ...
    ;
 } 

Instead of summing the terms, you could sum the squared terms (to penalise big differences), anything goes.

BTW for nominal data you could use some entropy-based measure of difference.

Upvotes: 2

gusbro
gusbro

Reputation: 22585

You could read any book related to Machine Learning, for example this one. The algorithm KNN (K nearest neighour) address your problem. You must basically define a distance measure over your problem and then compare those distances.

Upvotes: 2

Fred Foo
Fred Foo

Reputation: 363757

This is at least superficially similar to the vector space model (VSM) of information retrieval (IR). That's usually based on bags-of-words, but it could be adapted to other data representations.

The weights you describe would correspond to what is called "field boosting" in VSM IR.

But see also nearest neighbor search.

Upvotes: 2

Related Questions