Simon
Simon

Reputation: 352

Solr - Analysis works but zero query results

I'm trying to add support for Chinese in my index by using SmartChineseAnalyzer but despite working as expected in the Analysis page, trying to query does not return the item with the same text.

In the Analysis page, I'm using the following Chinese text as Field Value (Index) and 滴灌 as Field Value (Query).

The tokenization and marking of the searched term seem to work as expected (query value marked in bold):

netafim | 是 | 用于 | 实现 | 可 | 持续 | 未来 | 的 | 滴灌 | 和 | 微 | 灌溉 | 解决 | | 案 | 的 | 全球 | 领导者 |   | 在 | 水 |   | 粮食 | 安全 | 和 | 耕地 | 的 | 交汇 | 处 |   | 滴灌 | 可 | 使 | 种植 | 者 | 以 | 最低 | 的 | 环境 | 影响 | 实现 | 粮食 | 生产 | 的 | 最大化

However, simply querying for 滴灌 in the query page does not return any results.

It's important to note that this text does indeed appear in an item's attribute (description_zh) and that I AM able to find the item by querying by the parallel english attribute (description_en).

My configuration:

Solr version - 6.4.2

schema.xml

...
 <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

<fieldType name="text_general_zh" class="solr.TextField">
  <analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
</fieldType>

...

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="description_en" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="description_zh" type="text_general_zh" indexed="true" stored="true" multiValued="false"/>
<field name="size" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="name_en" type="text_en_splitting" indexed="true" stored="true" multiValued="false"/>
<field name="name_zh" type="text_en_splitting" indexed="true" stored="true" multiValued="false"/>

...

<uniqueKey>id</uniqueKey>

<!-- Copy Fields -->
<copyField source="name_en" dest="text"/>
<copyField source="name_zh" dest="text"/>
<copyField source="description_en" dest="text"/>
<copyField source="description_zh" dest="text"/>

...

solrconfig.xml

...

  <luceneMatchVersion>6.4.2</luceneMatchVersion>

  <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib" />
  <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" />

  <lib dir="../lib/" regex="mysql-connector-java-\d.*\.jar" />
  <lib dir="../lib/" regex="lucene-analyzers-smartcn-\d.*\.jar" />

  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />

  <directoryFactory name="DirectoryFactory" 
                    class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}">
  </directoryFactory>

  <codecFactory class="solr.SchemaCodecFactory"/>

  <schemaFactory class="ClassicIndexSchemaFactory"/>


  <dataDir>${solr.blacklight-core.data.dir:}</dataDir>

  <requestDispatcher handleSelect="true" >
    <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" />
  </requestDispatcher>

  <requestHandler name="/analysis/field" startup="lazy" class="solr.FieldAnalysisRequestHandler" />

  <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">  
      <str name="config">data-config.xml</str>
    </lst> 
  </requestHandler>

  <!-- config for the admin interface --> 
  <admin>
    <defaultQuery>*:*</defaultQuery>
  </admin>

  <requestHandler name="search" class="solr.SearchHandler" default="true">
    <!-- default values for query parameters can be specified, these
         will be overridden by parameters in the request
      -->
     <lst name="defaults">
       <str name="defType">dismax</str>
       <str name="echoParams">explicit</str>
       <int name="rows">10</int>

       <str name="q.alt">*:*</str>

       <str name="q.op">OR</str>
       <str name="df">text</str>

       <str name="mm">2&lt;-1 5&lt;-2 6&lt;90%</str>

       <str name="qf">
         name^100000
         description^25000
         text
       </str>
       <str name="pf">
         name^1000000
         description^250000
         text^10
       </str>

       <str name="fl">
          id, 
          name_en,
          name_zh,
          size,
          description_en,
          description_zh,
          created_at,
          updated_at
       </str>

       <str name="facet">true</str>
       <str name="facet.mincount">1</str>
       <str name="facet.limit">10</str>
       <str name="facet.field">company_size</str>

     </lst>

    ...

What am I missing?

Thanks! Simon.

Upvotes: 0

Views: 459

Answers (1)

Pavel Vasilev
Pavel Vasilev

Reputation: 1042

(1) Let's look closely at your solrconfig.xml:

   <str name="qf">
     name^100000
     description^25000
     text
   </str>

Which means you have name, description and text as query-fields in your dismax query-parser (with different field-boosts, but that's not that important).

(2) The text field accumulates field-values from several sources: name_en, name_zh, description_en, description_zh. This is what your schema.xml is doing:

<!-- Copy Fields -->
<copyField source="name_en" dest="text"/>
<copyField source="name_zh" dest="text"/>
<copyField source="description_en" dest="text"/>
<copyField source="description_zh" dest="text"/>

(3) From the other side the text field doesn't have has analysis for Chinese language (I bet it has fieldType=text_general - please correct me if I'm wrong). So by querying against text field you will never get Chinese-related text-analysis.

(4) In order to solve your problem you should make separation of the query-time field-set. I.e. instead of text field (which is accumulator for all fields) make it separate, like this:

   <str name="qf">
     name^100000
     description^25000
     name_en
     name_zh
     description_en
     description_zh
   </str>

And then the analyzer will be run against proper fieldType:

  • field=description_en against fieldType=text_general
  • field=description_zh against fieldType=text_general_zh

Upvotes: 2

Related Questions