Reputation: 16362
I'm converting a project from solr to cloudsearch, and have an issue for which I can't find a workaround after a decently long search of the doc and web. I'm hoping someone else can help.
I'm unable to describe the true details, but the closest example that I can find of my problem is one of plagiarism detection. Imagine having loaded a bunch of published documents into cloudsearch, and then taking an amateur document as the query to see if there's a match.
Given an indexed document - say Wikipedia's Tyrannosaurus page:
Like other tyrannosaurids, Tyrannosaurus was a bipedal carnivore with a massive skull balanced by a long, heavy tail.
Then along comes the amateur document:
I'm a carnivore, and I like the Tyrannosaurus because he was a bipedal carnivore, too.
For reasons that are important to the project, I'm creating a distribution of the interesting words, rather than query with the full text, e.g.:
carnivore: 2
tyrannosaurus: 1
And I'd like to give more bias to finding the word "carnivore" in the wikipedia article than I would to "tyrannosaurus".
In solr, I'm boosting the query using the "^" operator, e.g. "carnivore^2".
From what I can find, cloudsearch does boosting as "rank expressions", but I haven't found anything similar to my issue.
Any ideas?
Upvotes: 2
Views: 616
Reputation: 2585
Look for Zipf's law (There is also a similar called Zipf-Mandelbrot law but is harder to implement) basically it stands that for any language (an specifically in every specific domain) the distribution of the word-frequency obeys to a Zipfs distribution. You can build a word-frequency list ordering it to accommodate to zipf distribution, from it you can tune the parameters of the distribution and extrapolate the term relevance.
Based on your question I understand you are implementing some kind of td-idf this is more advanced than that. Unfortunately I think your question is more for a computer science / linguistic question and it requires more explanation than the one I can write in this post.
I don't use cloudsearch (I work in Natural Language Processing project too, but I don't use cloudsearch) but checking around I found this http://docs.aws.amazon.com/cloudsearch/latest/developerguide/rankexpressions.html
You can build the zipf distribution (or any customization/flavour of it) with those operands and define a threshold for your ranking.
This is not a "clean code" answer but I hope it will help you.
Upvotes: 1