Reputation: 125
I'm building an application which uses solr to match longer queries (typically, complete sentences) against indexed documents which are almost always shorter (search terms). So, my query looks like "should I buy a house now while the rates are low. We filed BR 2 yrs ago. Rent now, w/ some sch loan debt" and my indexed documents are like "buy a house", "house loan rates".
I thought that the right way to do this would be to use shingles, the dismax parser, and highly boosted "pf" field. So, I have a "normal" text field, kw_stopped (text_en in solr 3.4) with a very aggressive stopword list, and a kw_phrases field which is meant to be the phrase shingles. Its definition looks like this:
<fieldType name="shingle" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="8" outputUnigrams="false"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="8" outputUnigrams="false"/>
</analyzer>
</fieldType>
and my schema fields look like this:
<field name="kw_stopped" type="text_en" indexed="true" omitNorms="True" />
<!-- keywords almost as is - to provide truer match for full phrases -->
<field name="kw_phrases" type="shingle" indexed="true" omitNorms="True" />
My search handler config is this:
<requestHandler name="edismax" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.1</float>
<str name="fl">
keywords
</str>
<str name="mm">1</str>
<str name="qf">
kw_stopped^1.0 kw_phrases^5.0
</str>
<str name="pf">
kw_phrases^50.0
</str>
<int name="ps">3</int>
<int name="qs">3</int>
<str name="q.alt">*:*</str>
</lst>
</requestHandler>
When I turn on debugQuery, I notice that the "kw_phrases" is never matched unless the query and the document are exactly the same. Also the parsedquery shows that the each of the tokenized from the query appear as single DisjunctionMaxQuery clauses for "kw_stopped", but all shingles are put in one giant clause for the kw_phrases field.
Where is the gap in my understanding? How can I make this work?
thanks! Vijay
Upvotes: 2
Views: 3150
Reputation: 52769
If you are using long sentences to search against shorter documents, you seem to be going fine.
Surely, you would need a nice stopwords filter list to prevent general terms matches during both index and search time.
Upvotes: 4