raghu
raghu

Reputation: 379

Using match scores to determine right features (Machine Learning)

I am familiar with determining the extent of match of a given set of documents in our knowledge base against a search query document (based on cosine distance) once the features are available. We would map both on the vector space based on the features.

How do I handle the reverse- I have been given a set of documents and the match score against multiple query documents and have to determine the features(or decision criteria to determine match). This would be the training data, and the model would be used to identify matches against our knowledge database for new search queries

Our current approach is to think up a set of features and see which combinations get the best match scores in the training set... but we will end up trying multiple combinations. Is there a better way to do this?

Upvotes: 0

Views: 558

Answers (1)

greeness
greeness

Reputation: 16104

Here is a simple and straightforward way (Linear Model) that should work. If you are working with documents and queries, probably the features you are using are those tokens(or words) or n-grams or topics. Let's assume the features are words for simplicity.

Suppose you have a query document:

apple iphone6 

and you have a set of documents and their corresponding match scores against the above query: (Assume the documents are the contents of urls)

www.apple.com (Apple - iPhone 6) score: 0.8
www.cnet.com/products/apple-iphone-6 (Apple iPhone 6 review), score: 0.75
www.stuff.tv/apple/apple-iphone-6/review (Apple iPhone 6 review), score: 0.7
....

Per-query model

First you need to extract word features from the matching urls. suppose we get word and their L1-normalized TF/IDF scores:

www.apple.com
apple 0.5
iphone 0.4
ios8 0.1

www.cnet.com/products/apple-iphone-6
apple 0.4
iphone 0.2
review 0.2
cnet 0.2

www.stuff.tv/apple/apple-iphone-6/review
apple 0.4
iphone 0.4
review 0.1
cnet 0.05
stuff 0.05

Second you can combine feature scores and match scores and aggregate on a per-feature basis:

w(apple) = 0.5 * 0.8 + 0.4 * 0.75 + 0.1 * 0.7 = 0.77
w(iphone) = 0.4 * 0.8 + 0.2 * 0.75 + 0.4 * 0.7 = 0.75
w(ios8) = 0.1 * 0.8 = 0.08
w(review) = 0.2 * 0.75 + 0.1 * 0.7 = 0.22
w(cnet) = 0.2 * 0.75 = 0.15
w(stuff) = 0.05 * 0.7 = 0.035

You might want to do a normalization step to divide each w by the number of documents. Now you get below features ordered by their relevancy decendingly:

w(apple)=0.77 / 3
w(iphone)=0.75 / 3
w(review)=0.22 / 3
w(cnet)=0.15 / 3
w(ios8)=0.08 / 3
w(stuff)=0.035 / 3

You even get a linear classifier by using those weights:

score = w(apple) * tf-idf(apple) + w(iphone) * tf-idf(iphone) + ... + w(stuff) * tf-idf(stuff)

Suppose now you have a new url with those features detected:

ios8: 0.5
cnet: 0.3
iphone:0.2

You can then calculate its match score regarding query "apple iphone6":

score = w(ios8)*0.5 + w(cnet)*0.3 + w(iphone)*0.2
      = (.08*.5 + .15*0.3 + .75*.2 ) / 3

The match score can then be used to rank documents regarding their relevancy to the same query.

Any-query model

You perform the same thing to construct a linear model for each query. Suppose you have k such queries and matching documents in your training data, you will end up with k such models; each model is constructed based on one query.

model(apple iphone6) = (0.77*apple + 0.75iphone + 0.22review + ...) / 3
model(android apps) = (0.77google + 0.5android + ...) / 5
model(samsung phone) = (0.5samsung + 0.2galaxy + ...) / 10

Note in above example models, 3, 5, 10 are the normalizer (the total number of documents matched to each query).

Now a new query comes, suppose it is :

samsung android release

Our tasks left are to:

  • find relevant queries q1, q2, ..., qm
  • use query models to score new documents and aggregate.

You first need to extract features from this query and also suppose you have already cached the features for each query you have learned. Based on any nearest neighbor approach (e.g., Locality sensitive hashing), you can find the top-k similar queries to "samsung android release", probably they should be:

similarity(samsung phone, samsung android release) = 0.2
similarity(android apps, samsung android release) = 0.2

Overall Ranker

Thus we get our final ranker as:

0.2*model(samsung phone) + 0.2*model(android apps) =
  0.2* (0.77*apple + 0.75iphone + 0.22review + ...) / 3 + 
  0.2* (0.77google + 0.5android + ...) / 5

Usually in those information retrieval apps, you have constructed inverted index from features (words) to documents. Therefore the final rank should be able to evaluated very efficiently across top documents.

Reference

For details, please refer to the IND algorithm in Omid Madani et al. Learning When Concepts Abound.

Upvotes: 2

Related Questions