Reputation: 9084
I have a doubt that I cannot answer to my self even when I was trying hard.
I think is a matter of comprehension.
So...
Im trying to index a long text field (a product description), which can have duplicates words. Lets say we are talking about a flavour and we say chocolate, then continues speaking and then again chocolate.
When solr is indexing, (as far as I understand the analysis tab in the solr control panel), it will create a term (which are "pointers", each term -> associated to a uniqueKey atribute which identify the "item")for each token we have.
Does the solr index gonna have two terms pointing to the same item ?
This is my text analyzer:
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" />
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
I though deletes duplicates entries, but when I have a look to the analysis found this:
As far as I undestand solr, at the end, in my index there is gonna be this three terms pointing to that "item": chocolate, blablabla and chocolate. Is that right ?
I hope the question is clear :)
Thanks !
Upvotes: 0
Views: 574
Reputation: 939
What you see after Analysis, is just before when text is indexed onto Solr. When you actually index it, it stores each term just once, and saves all occurrences of that term in form of (document_id, position).
Hope example below makes it more clear.
Suppose you want to add following three documents onto Solr:
T[0] = "dark chocolate is the best chocolate"
T[1] = "i love dark chocolate"
T[2] = "chocolate is delicious"
Solr will store in inverted index as follows:
"best": {(T[0], position)}
"chocolate": {(T[0], position1), (T[0], position2), (T[1], position), (T[2], position)}
"dark": {(T[0], position), (T[1], position)}
"delicious": {(T[2], position)}
"i": {(T[1], position)}
"is": {(T[0], position), (T[1], position)}
"love": {(T[0], position)}
"the": {(T[0], position)}
Note:
Upvotes: 7