Alain Collins
Alain Collins

Reputation: 16362

Boosting cloudsearch results based on frequency of terms in input

I'm converting a project from solr to cloudsearch, and have an issue for which I can't find a workaround after a decently long search of the doc and web. I'm hoping someone else can help.

I'm unable to describe the true details, but the closest example that I can find of my problem is one of plagiarism detection. Imagine having loaded a bunch of published documents into cloudsearch, and then taking an amateur document as the query to see if there's a match.

Given an indexed document - say Wikipedia's Tyrannosaurus page:

Like other tyrannosaurids, Tyrannosaurus was a bipedal carnivore with a massive skull balanced by a long, heavy tail.

Then along comes the amateur document:

I'm a carnivore, and I like the Tyrannosaurus because he was a bipedal carnivore, too.

For reasons that are important to the project, I'm creating a distribution of the interesting words, rather than query with the full text, e.g.:

carnivore: 2
tyrannosaurus: 1

And I'd like to give more bias to finding the word "carnivore" in the wikipedia article than I would to "tyrannosaurus".

In solr, I'm boosting the query using the "^" operator, e.g. "carnivore^2".

From what I can find, cloudsearch does boosting as "rank expressions", but I haven't found anything similar to my issue.

Any ideas?

Upvotes: 2

Views: 616

Answers (1)

Ezequiel Gorbatik
Ezequiel Gorbatik

Reputation: 2585

Look for Zipf's law (There is also a similar called Zipf-Mandelbrot law but is harder to implement) basically it stands that for any language (an specifically in every specific domain) the distribution of the word-frequency obeys to a Zipfs distribution. You can build a word-frequency list ordering it to accommodate to zipf distribution, from it you can tune the parameters of the distribution and extrapolate the term relevance.

Based on your question I understand you are implementing some kind of td-idf this is more advanced than that. Unfortunately I think your question is more for a computer science / linguistic question and it requires more explanation than the one I can write in this post.

I don't use cloudsearch (I work in Natural Language Processing project too, but I don't use cloudsearch) but checking around I found this http://docs.aws.amazon.com/cloudsearch/latest/developerguide/rankexpressions.html

You can build the zipf distribution (or any customization/flavour of it) with those operands and define a threshold for your ranking.

This is not a "clean code" answer but I hope it will help you.

Upvotes: 1

Related Questions