Solr index-time tokenizer / filter not working as expected?

Question

I'm working with a solr instance set up earlier at my company, and it seems to not be set up correctly. I'm able to search for something like q=*Paper* to get results but not for paper.

It seems like maybe the index-time tokenizer / filter isn't working as I'd expected.

The schema.xml is set up to tokenize and then index & query without case sensitivity on this description field for example :


...etc...

And the solrconfig.xml has the default qf set to:


  
      false
      default
      wordbreak
      true
      false
      true
      3
      1
      10
      synonym_edismax
      false
      C_PN^20.0 PN^15.0 C_S_DSC^10.0 S_DSC^10.0 M_PN^5.0 DIM_NM^2.0 BRD^2.0 combined_search^1
  {!type=synonym_edismax qf=$qf v=$q}

When I query for q=* I get results select?q=*&rows=10&start=0&wt=json

    "docs": [
        {
            "S_DSC": "Foo 8.5\" x 11\" Copy Paper, 20 lbs, 92 Brightness, 5000/Carton (123456)"
            ...etc...
        },

But if I try to search on a term in the description (S_DSC), I don't get results unless it's case sensitive AND I put asterisks around it.

I get results for q=*Paper*

"parsedquery": "(+DisjunctionMaxQuery((combined_search:*paper* | PN:*Paper*^15.0 | S_DSC:*paper*^10.0 | C_PN:*Paper*^20.0 | BRD:*Paper*^2.0 | M_PN:*Paper*^5.0 | DIM_NM:*Paper*^2.0 | C_S_DSC:*paper*^10.0)))/no_coord",

No results for q=paper

"parsedquery": "(+DisjunctionMaxQuery((combined_search:paper | PN:paper^15.0 | S_DSC:paper^10.0 | C_PN:paper^20.0 | BRD:paper^2.0 | M_PN:paper^5.0 | DIM_NM:paper^2.0 | C_S_DSC:paper^10.0)))/no_coord",

No results for q=Paper

"parsedquery": "(+DisjunctionMaxQuery((combined_search:paper | PN:Paper^15.0 | S_DSC:paper^10.0 | C_PN:Paper^20.0 | BRD:Paper^2.0 | M_PN:Paper^5.0 | DIM_NM:Paper^2.0 | C_S_DSC:paper^10.0)))/no_coord",

Shouldn't it be tokenizing the S_DSC above then lowercasing the tokens? (So that paper is among them?) What am I missing here? Appreciate any insight :)

MatsLindh · Accepted Answer

Your S_DSC field is not indexed:

 indexed="false"  <--

An unindexed field will never generate a hit. My guess is that your hit is coming from one of the other, unprocessed fields which are indexed, and that's why you're getting the behaviour you're seeing.

When you append debug=all to your query, each found document will shown the term frequency matched (i.e. what makes up the score) for each field, allowing you to see which fields generated hits.

Solr index-time tokenizer / filter not working as expected?

Answers (1)

Related Questions