Reputation: 352
I'm trying to add support for Chinese in my index by using SmartChineseAnalyzer
but despite working as expected in the Analysis page, trying to query does not return the item with the same text.
In the Analysis page, I'm using the following Chinese text as Field Value (Index)
and 滴灌 as Field Value (Query)
.
The tokenization and marking of the searched term seem to work as expected (query value marked in bold):
netafim | 是 | 用于 | 实现 | 可 | 持续 | 未来 | 的 | 滴灌 | 和 | 微 | 灌溉 | 解决 | | 案 | 的 | 全球 | 领导者 | | 在 | 水 | | 粮食 | 安全 | 和 | 耕地 | 的 | 交汇 | 处 | | 滴灌 | 可 | 使 | 种植 | 者 | 以 | 最低 | 的 | 环境 | 影响 | 实现 | 粮食 | 生产 | 的 | 最大化
However, simply querying for 滴灌 in the query page does not return any results.
It's important to note that this text does indeed appear in an item's attribute (description_zh
) and that I AM able to find the item by querying by the parallel english attribute (description_en
).
My configuration:
Solr version - 6.4.2
...
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_general_zh" class="solr.TextField">
<analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
</fieldType>
...
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="description_en" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="description_zh" type="text_general_zh" indexed="true" stored="true" multiValued="false"/>
<field name="size" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="name_en" type="text_en_splitting" indexed="true" stored="true" multiValued="false"/>
<field name="name_zh" type="text_en_splitting" indexed="true" stored="true" multiValued="false"/>
...
<uniqueKey>id</uniqueKey>
<!-- Copy Fields -->
<copyField source="name_en" dest="text"/>
<copyField source="name_zh" dest="text"/>
<copyField source="description_en" dest="text"/>
<copyField source="description_zh" dest="text"/>
...
...
<luceneMatchVersion>6.4.2</luceneMatchVersion>
<lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib" />
<lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" />
<lib dir="../lib/" regex="mysql-connector-java-\d.*\.jar" />
<lib dir="../lib/" regex="lucene-analyzers-smartcn-\d.*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />
<directoryFactory name="DirectoryFactory"
class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}">
</directoryFactory>
<codecFactory class="solr.SchemaCodecFactory"/>
<schemaFactory class="ClassicIndexSchemaFactory"/>
<dataDir>${solr.blacklight-core.data.dir:}</dataDir>
<requestDispatcher handleSelect="true" >
<requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" />
</requestDispatcher>
<requestHandler name="/analysis/field" startup="lazy" class="solr.FieldAnalysisRequestHandler" />
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
<!-- config for the admin interface -->
<admin>
<defaultQuery>*:*</defaultQuery>
</admin>
<requestHandler name="search" class="solr.SearchHandler" default="true">
<!-- default values for query parameters can be specified, these
will be overridden by parameters in the request
-->
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="q.alt">*:*</str>
<str name="q.op">OR</str>
<str name="df">text</str>
<str name="mm">2<-1 5<-2 6<90%</str>
<str name="qf">
name^100000
description^25000
text
</str>
<str name="pf">
name^1000000
description^250000
text^10
</str>
<str name="fl">
id,
name_en,
name_zh,
size,
description_en,
description_zh,
created_at,
updated_at
</str>
<str name="facet">true</str>
<str name="facet.mincount">1</str>
<str name="facet.limit">10</str>
<str name="facet.field">company_size</str>
</lst>
...
What am I missing?
Thanks! Simon.
Upvotes: 0
Views: 459
Reputation: 1042
(1) Let's look closely at your solrconfig.xml:
<str name="qf">
name^100000
description^25000
text
</str>
Which means you have name
, description
and text
as query-fields in your dismax
query-parser (with different field-boosts, but that's not that important).
(2) The text
field accumulates field-values from several sources: name_en
, name_zh
, description_en
, description_zh
. This is what your schema.xml is doing:
<!-- Copy Fields -->
<copyField source="name_en" dest="text"/>
<copyField source="name_zh" dest="text"/>
<copyField source="description_en" dest="text"/>
<copyField source="description_zh" dest="text"/>
(3) From the other side the text
field doesn't have has analysis for Chinese language (I bet it has fieldType=text_general
- please correct me if I'm wrong). So by querying against text
field you will never get Chinese-related text-analysis.
(4) In order to solve your problem you should make separation of the query-time field-set. I.e. instead of text
field (which is accumulator for all fields) make it separate, like this:
<str name="qf">
name^100000
description^25000
name_en
name_zh
description_en
description_zh
</str>
And then the analyzer will be run against proper fieldType
:
field=description_en
against fieldType=text_general
field=description_zh
against fieldType=text_general_zh
Upvotes: 2