Eric
Eric

Reputation: 452

Solr Queries: Single Terms versus Phrases

In our search based on Solr, we have started by using phrases. For example, when the user types

blue dress

then the Solr query will be

title:"blue dress" OR description:"blue dress"

We now want to remove stop words. Using the default StopFilterFactory, the query

the blue dress

will match documents containing "blue dress" or "the blue dress".

However, when typing

blue the dress

then it does not match documents containing "blue dress".

I am starting to wonder if we shouldn't instead only search using single terms. That is, convert the above user search into

title:the OR title:blue OR title:dress OR description:the OR description:blue OR description:dress

I am a bit reluctant to do this, though, as it seems doing the work of the StandardTokenizerFactory.

Here is my schema.xml:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
      <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
      <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" />
  </analyzer>
</fieldType>

The title and the description fields are both of type text_general.

Is the single terms search the standard way of searching in Solr? Am I exposing ourselves to problems by tokenising the words before calling Solr (performance issues, maybe)? Maybe thinking in term of single terms vs. phrases is just wrong and we should leave it to the user to decide?

Upvotes: 1

Views: 1896

Answers (2)

Eric
Eric

Reputation: 452

Although the initial approach might work if the query was split into multiple title:term statements, this is prone to errors (as the tokens might be split in the wrong places) and is also duplicating, probably badly, the work done by the built-in tokenizer.

The right approach is to maintain the initial query as-is and rely on the Solr configuration to handle it properly. This makes sense, but the difficulty was that I wanted to specify the fields in which I wanted to search. And it turns out that there is no way to do that using the default query parser, which is the one known as LuceneQParserPlugin (confusingly, there is a parameter called fl, for Field List, which is used for specifying the returned fields, not the fields to search in).

To be complete, it must be mentioned that it is possible to simulate the list of parameters to search in by using the copyField configuration is schema.xml. I do not find this very elegant nor flexible enough.

The elegant solution is to use the ExtendedDisMax query parser, aka edismax. With it, we can maintain the query as is, and fully leverage the configuration in the schema. In our case, it looks like this:

        SolrQuery solrQuery = new SolrQuery();
        solrQuery.set("defType", "edismax");
        solrQuery.set("q", query); // ie. "blue the dress"
        solrQuery.set("qf", "description title");

According to this page:

(e)Dismax generally makes the best first choice query parser for user facing Solr applications

It would have helped if this had indeed been the default choice.

Upvotes: 0

cheffe
cheffe

Reputation: 9500

What you stumble over is the fact that the stopwordfilter prevents the indexing of stopwords, but their position is indexed nevertheless. Something like a spaceholder is stored in the index where the stopword occurs.

So when you put this to your index

the blue dress

it will be indexed as

* blue dress

The same happens when you hand in the phrase

"blue the dress"

as a query. It will be treated as

"blue * dress"

Now Solr compares these two fragments and it does not match as the * is at the wrong position.

Prior to Solr 4.4 this used to be tackled via setting enablePositionIncrements="true" in the StopFilterFactory as described by Pascal Dimassimo. Apparently there has been a refactoring that did break that option on the StopFilterFactory as discussed on SO and Solr's Jira.


Update When reading through the reference documentation of the Extended Dis Max Query Parser I found this

The stopwords Parameter

A Boolean parameter indicating if the StopFilterFactory configured in the query analyzer should be respected when parsing the query: if it is false, then the StopFilterFactory in the query analyzer is ignored.

I will check if this helps with the problem.

Upvotes: 1

Related Questions