Reputation: 4003

How can I sort facets by their tf-idf score, rather than popularity?

For a specific facet field of our Solr documents, it would make way more sense to be able to sort facets by their relative "interesting-ness" i.e. their tf-idf score, rather than by popularity. This would make it easy to automatically get rid of unwanted common English words, as both their TF and DF would be high.

When a query is made, TF should be calculated, using all the documents that participate in teh results list.

I assume that the only problem with this approach would be when no query is made, resp., when one searches for ":". Then, no term will prevail over the others in terms of interestingness. Please, correct me if I am wrong here.

Anyway,is this possible? What other relative measurements of "interesting-ness" would you suggest?

Upvotes: 3

Answers (3)

ewomant

Reputation: 11

There has been a discussion about this way back in 2009.

Currently, with the larger flexibility of facet.json, e.g. sorting on stats-facets (e.g. avg(price)) of another field, I guess this could be implemented as an additional sort-option. At least for facets of type term, the result-count (df for current result-set) only needs to be divided by the df of that term for the index (docfreq). If the current result-set is the complete index, facets should be sorted by count.

I will probably implement a workaround in the client for fields with a fixed and rather small vocabulary, e.g. based on a second, cashed query on the complete index. However, for term-fields and similar this might not scale.

Upvotes: 0

ewomant

Reputation: 11

This is a very interesting idea and I have been searching around for some time to find a solution. Anything new in this area?

I assume that for facets with a limited number of possible values, an interestingness-score can be computed on the client side: For a given result set based on a filter, we can exclude this filter for the facet using the local params-syntax (!tag & !ex) Local Params - On the client side, we can than compute relative compared to the complete index (or another subpart of a filter). This would probably not work for result sets build by a query-parameter.

However, for an indexed text-field with many potential values, such as a fulltext-field, one would have to retrieve df-counts for all terms. I imagine this could be done efficiently using the terms component and probably should be cached on the client-side / in memory to increase efficiency. This appears to be a cumbersome method, however, and doesn't give the flexibility to exclude only certain filters.

For these cases, it would probably be better to implement this within solr as a new option for facet.sort, because the information needed is easily available at the time facet counts are computed.

Upvotes: 1

Mysterion

Reputation: 9320

facet.sort

This param determines the ordering of the facet field constraints.

count - sort the constraints by count (highest count first) index - to return the constraints sorted in their index order (lexicographic by indexed term). For terms in the ascii range, this will be alphabetically sorted. The default is count if facet.limit is greater than 0, index otherwise.

Prior to Solr1.4, one needed to use true instead of count and false instead of index.

This parameter can be specified on a per field basis.

It looks like you couldn't do it out of the box without some serious changes on client side or in Solr.

Upvotes: 1

How can I sort facets by their tf-idf score, rather than popularity?

Answers (3)

Related Questions