Reputation: 370
I have a single-core [1], non-replicating Solr index containing ~40 million documents. Each document has two fields, one stored, the other not. I search on the non-stored field, the stored field being my result.
Response times from this index are around 8 seconds. Something to note is that I'm not making what I consider the typical full-text query. Each query contains dozens of OR terms. I expected this to be slow, but not quite as slow as it is.
Something I notice is that Solr is using only a couple of hundred MBs of the 7GBs its JVM has available. It can't be keeping much of the index in memory. Which leads to my question: is there a way to configure solr such that it is forced to maintain much (or at least more) of its index in RAM?
[1] Sharding introduces a problem for me. Relative scores are extremely important in my application of Solr. Shard-local scoring means the more shards I have, the less accurate scores become.
More information in response to comments:
Here's the field type definition for the field I search on:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
and here's an example query:
(Carberry J 2008 Toward a Unified Theory of High-Energy Metaphysics Silly String Theory Journal of Psychoceramics 5 11 1 3)
This will take around 10s to respond, whereas a query with fewer ORed terms, such as (Carberry 2008), will return in ~100ms.
Upvotes: 1
Views: 4943
Reputation: 370
I believe I've found and solved the problem I had.
Turns out that many of my documents, since they are made up of bibliographic metadata, contain some very common words on top of the usual English stop words. Such words include 'journal' and 'proceedings'. Further, because my documents contain author names, often including initials, many of them contained indexed single-letter terms. If any of these were included as a query term, response time would go up an order of magnitude.
My solution was to simply filter out these common terms using a StopFilter and LengthFilter, like so:
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LengthFilterFactory" min="2" max="100"/>
Upvotes: 5