Reputation: 500
We are trying to tune our phrase queries in DSE search. For example, if we have column name X with the value "D A T A S T A X" we are searching for exact match for X:"T A S T"
Words are tokenized with with whitespacetokenizer.
We have couple hundred Million records in database and all the indexes are memory (We tested using pcstat). However still the queries are taking 5-15 sec. Why it is taking so time to pull the results if all the indexes are in memory? How can I tune this?
Any help is appreciated.
Upvotes: 3
Views: 547
Reputation: 13402
Try this fieldType:
<fieldType name="custom_edge_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([^A-Za-z0-9])" replacement="" replace="all"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([^A-Za-z0-9])" replacement="" replace="all"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Here the KeywordTokenizerFactory tokenizeer will pass the text stream exactly to the filters. The PatternReplaceFilterFactory will remove all except characters and numbers. You can config this however you want. Then we lowercase the stream and generate the NGram. This is for the index phase. For the query phase we don't do the NGram because we want to match the exact sub string.
We will be use the NGram instead of EdgeNGram, Because that will provides substring. The EdgeNGram always contain either from start or end. So EdgeNGram is not helpful in this case.
Hope this helps.
Upvotes: 2