Reputation: 622
I have set up a SOLR environment and am using a text_nl fieldtype which I fill with several other fields.
I am experiencing some odd behavior. Whenever I search for "new", the query returns results with new in the index, but also some results which don't have the "new" string in them. I already disabled al the filter factories, but to no avail. I keep getting results in the query, which do not contain this word.
Below you will find pieces of my solrconfig.xml and schema.xml.
Fieldtype text_nl:
<fieldType name="text_nl" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_nl.txt" format="snowball" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" />
<filter class="solr.ReversedWildcardFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_nl.txt" format="snowball" />
<!-- <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" /> -->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Field names:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="Merk" type="text_nl" indexed="false" stored="true"/>
<field name="Model" type="text_nl" indexed="false" stored="true" multiValued="true" />
<field name="Kleur" type="text_nl" indexed="false" stored="true"/>
<field name="Collectie" type="text_nl" indexed="false" stored="true"/>
<field name="Categorie" type="text_nl" indexed="true" stored="true"/>
<field name="MateriaalSoort" type="text_nl" indexed="false" stored="true"/>
<field name="Zool" type="text_nl" indexed="false" stored="true"/>
<field name="Omschrijving" type="text_nl" indexed="false" stored="true"/>
<field name="text" type="text_nl" indexed="true" stored="true" multiValued="true"/>
Solrconfig.xml
<requestHandler name="/query" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">50000</int>
<str name="wt">json</str>
<str name="indent">true</str>
<str name="df">text</str>
<str name="fl">id,Merk,Model,Kleur,Collectie,Categorie,Zool,Omschrijving</str>
<str name="qf">Merk^100 Model^0.8 Omschrijving^0.3 id^1.0</str>
<str name="pf">Merk^100 Model^0.8 Omschrijving^0.3 id^1.0</str>
</lst>
The data is as follows: /query?q=new
Yields:
{
"id":"3215.70.101204",
"Merk":"New balance",
"Model":["M576"],
"Kleur":"Groen",
"Collectie":"Herenschoenen",
"Categorie":"Sneakers",
"Zool":"Rubber",
"Omschrijving":"Groene nubuck special runner van het merk New Balance. Het logo is van groen nubuck."},
{
"id":"3215.26.104592",
"Merk":"Greve",
"Model":["6260"],
"Kleur":"Jeans",
"Collectie":"Herenschoenen",
"Categorie":"Sneakers",
"Zool":"Rubber",
"Omschrijving":"Deze jeans blauwe suède/lederen runner is van het merk Greve. De runner heeft een merklabel van Greve aan de achterzijde. De runner heeft een witte met houten middenzool en een rubberen zool, verder heeft de runner zilveren studs details."},
As you can see there is no "new" in the result of the second id.
This is the result of the debug query:
debug":{
"rawquerystring":"new",
"querystring":"new",
"parsedquery":"text:new",
"parsedquery_toString":"text:new",
"explain":{
"3215.13.101204":"\n1.4514455 = (MATCH) weight(text:new in 2047) [DefaultSimilarity], result of:\n 1.4514455 = fieldWeight in 2047, product of:\n 1.7320508 = tf(freq=3.0), with freq of:\n 3.0 = termFreq=3.0\n 4.469293 = idf(docFreq=113, maxDocs=3661)\n 0.1875 = fieldNorm(doc=2047)\n",
"3215.30.101204":"\n1.4514455 = (MATCH) weight(text:new in 2142) [DefaultSimilarity], result of:\n 1.4514455 = fieldWeight in 2142, product of:\n 1.7320508 = tf(freq=3.0), with freq of:\n 3.0 = termFreq=3.0\n 4.469293 = idf(docFreq=113, maxDocs=3661)\n 0.1875 = fieldNorm(doc=2142)\n",
"3215.70.101204":"\n1.4514455 = (MATCH) weight(text:new in 2217) [DefaultSimilarity], result of:\n 1.4514455 = fieldWeight in 2217, product of:\n 1.7320508 = tf(freq=3.0), with freq of:\n 3.0 = termFreq=3.0\n 4.469293 = idf(docFreq=113, maxDocs=3661)\n 0.1875 = fieldNorm(doc=2217)\n",
"3215.26.104592":"\n1.3966541 = (MATCH) weight(text:new in 2137) [DefaultSimilarity], result of:\n 1.3966541 = fieldWeight in 2137, product of:\n 2.0 = tf(freq=4.0), with freq of:\n 4.0 = termFreq=4.0\n 4.469293 = idf(docFreq=113, maxDocs=3661)\n 0.15625 = fieldNorm(doc=2137)\n",
"3215.34.104592":"\n1.3966541 = (MATCH) weight(text:new in 2185) [DefaultSimilarity], result of:\n 1.3966541 = fieldWeight in 2185, product of:\n 2.0 = tf(freq=4.0), with freq of:\n 4.0 = termFreq=4.0\n 4.469293 = idf(docFreq=113, maxDocs=3661)\n 0.15625 = fieldNorm(doc=2185)\n",
"3215.70.104592":"\n1.3966541 = (MATCH) weight(text:new in 2232) [DefaultSimilarity], result of:\n 1.3966541 = fieldWeight in 2232, product of:\n 2.0 = tf(freq=4.0), with freq of:\n 4.0 = termFreq=4.0\n 4.469293 = idf(docFreq=113, maxDocs=3661)\n 0.15625 = fieldNorm(doc=2232)\n",
Upvotes: 0
Views: 404
Reputation: 33351
This is probably happening due to the combination of EdgeNGramFilter
and ReversedWildcardFilter
. EdgeNGramFilter
is first splitting terms into ngrams of size three or more. Each of these are then indexed in both forward and reversed form, so if you index the word "went", you end up with:
And so you get a match on the term "went" with a query for "new". Any word containing either "new" or "wen" can be expected to match.
Really, I think using both of these is overkill. Reversing ngrams doesn't make a great deal of sense to me. Both of them are approaches to similar problems, and they don't make sense used together, to my mind.
Also, you may have a synonym defined in "synonyms.txt" for the word "new".
Upvotes: 0