Reputation: 65
I am using Solr through PHP for searching all aspects of my site. I am trying to implement a feature and can't find any information on how to accomplish it.
I have a group of documents (reviews), each about a specific product.
I want to find unique 1-2 word keywords (no stop words) that appear in multiple reviews for a single product, with a count for how many reviews they appear in.
Once I have that, I want to show the top X keywords, number of reviews they are in, and a single top review for each one highlighted the use of the keyword.
EDIT:
Once I have a list of unique (non stop word/common words) keywords that appear in multiple reviews, I want to rank them by the number of times they appear across reviews. For example, if people are writing reviews about cameras, the keywords might appear like this:
expensive (appears in 7 reviews) shutter speed (appears in 5 reviews) poor image (appears in 3 reviews)
Once I have those keywords ranked by number of reviews, I want to select 1 review per keyword and show those reviews highlighting the keyword. For example:
"... unfortunately this camera is far too EXPENSIVE for what you get ..." (in 7 reviews) "... the SHUTTER SPEED is far too slow for ..." (in 5 reviews) "... the POOR IMAGE quality is tis cameras biggest downfall ..." (in 3 reviews)
As far as when to run this, I'm still not sure. Possibly real time (when you view a product, then cached for X time), whenever a new review is posted, mark the product to be updated, or on a cronjob daily, etc. It will not be run against all keywords at one time, it will be run against all keywords in all reviews for a single product. Then repeated for each product.
Hope that makes more sense.
Any help on how to accomplish this in Solr would be greatly appreciated.
Upvotes: 3
Views: 1540
Reputation: 2706
It sounds to me that what you're looking for is the ShingleFilter.You can use it to produce unigrams/bigrams (probably with a copyfield) and then get stats on those tokens to generate your interface.
Upvotes: 1
Reputation: 193
This looks like a job for a text parser rather than solr. You will need a script probably in python (since it has good text parsing libs) that looks at all the words in the reviews and then gives you the top occurring words within each review (or) in all reviews with their counts. Then you can index few words on either side of these top occurring words and create an abstract for your document (the product in this case) and index it in Solr to be returned as part of the search result.
Upvotes: 0
Reputation: 3513
This task is not particularly well suited to solr. The only thing you gain from using solr is the stemming/stop word support which would be much faster if implemented in a local algorithm. I would create a new table in the database for "review_keyword" mapping reviews to keyword singletons and pairs. When inserting a new review, also add a mapping to a separate row for each keyword in the review (this is where stemming/stop words kicks in). You can run a join select across this table when you want to lookup reviews for a product to get the top keywords in reviews for a product, and a review from that set. Depending on your usage, this would be better run on updates, rather than queries.
Upvotes: 0