rails9231
rails9231

Reputation: 39

Thinking sphinx ranking and statistics

I'm trying to set up an ability to get some numbers from my Sphinx indexes, but not sure how to get the info I want.

I have a mysql db with articles, sphinx index set up for that db and full text search, all working. What I want is to get some numbers:

  1. How many times search text (keyword, or key phrase) appears over all articles for all time (more likely limited to "articles from time interval from X and to Y")
  2. Same as previous but for how many times 2 keywords or keyphrases (so "x AND y") appear in same articles

I was doing something similar to first manually using bat file I made

indexer ind_core -c c:\%SOME_PATH%\development.sphinx.conf --buildstops stats.txt 10000 --buildfreqs

Which generated me a txt with all repeating keywords and how often they appear at early development stages, which helped to form a list of keywords I'm interested in. Now I'm trying to do the same but just for a finite list of predetermined keywords and integrated into my rails project to be able to build charts in future.

I tried running some queries like

@testing = Article.search 'Keyword1 AND Keyword2', :ranker => :wordcount

but I'm not sure how it works and how to process the result, as well as if that's what I'm looking for.

Another approach I tried was manual mysql queries such as

 SELECT id,title,WEIGHT() AS w FROM ind_core WHERE MATCH('@title keyword1 | keyword2') OPTION ranker=expr('sum(hit_count)');

but I'm not sure how to process results from here either (as well as how to actually implement it into my existing rails project), and it's limited to 20 lines per query (which I think I can change somewhere in settings?). But at least looking at mysql results what I'm interested in is hit_count over all articles (or all articles from set timeframe).

Any ideas on how to do this?

UPDATE: Current way I found was to add

@testing = Article.search params[:search], :without => {:is_active => false}, :ranker => :bm25

to controller with some conditions (so it doesn't bug out from nil search). :is_active is my soft delete flag, don't want to search deleted entries, so don't mind it. And in view I simply displayed

<%= @testing.total_entries %>

Which if I understand it correct shows me number of matches sphinx found (so pretty much what I was looking for).

Upvotes: 0

Views: 303

Answers (1)

pat
pat

Reputation: 16226

So, to figure out the number of hits per document, you're pretty much on the right track, it's just a matter of getting it into Ruby/Thinking Sphinx.

To get the raw Sphinx results (if you don't need the ActiveRecord objects):

search = Article.search "foo",
  :ranker     => "expr('SUM(hit_count)')",
  :select     => "*, weight()",
  :middleware => ThinkingSphinx::Middlewares::RAW_ONLY

… this will return an array of hashes, and you can use the weight() string key for the hit count, and the sphinx_internal_id string key for the model's primary key (id is Sphinx's own primary key, which isn't so useful).

Or, if you want to use the ActiveRecord objects, Thinking Sphinx has the ability to wrap each search result in a helper object which passes appropriate methods through to the underlying model instances, but lets weight respond with the values from Sphinx:

search = Article.search "foo",
  :ranker     => "expr('SUM(hit_count)')",
  :select     => "*, weight()"; ""
search.context[:panes] << ThinkingSphinx::Panes::WeightPane
search.each do |article|
  puts article.weight
end

Keep in mind that panes must be added before the search is evaluated, so if you're testing this in a Rails console, you'll want to avoid letting the console inspect the search variable (which I usually do by adding ; "" at the end of the initial search call.

In both of these cases, as you've noted, the search results are paginated - you can use the :page option to determine which page of results you want, and :per_page to determine the number of records returned in each request. There is a standard limit of 1000 results overall, but that can be altered using the max_matches setting.

Now, if you want the number of times the keywords appear across all Sphinx records, then the best way to do that while also taking advantage of Thinking Sphinx's search options, is to get the raw results of an aggregate SUM - similar to the first option above.

search = Article.search "foo",
  :ranker     => "expr('SUM(hit_count)')",
  :select     => "SUM(weight()) AS count",
  :middleware => ThinkingSphinx::Middlewares::RAW_ONLY
search.first["count"]

Upvotes: 1

Related Questions