avcajaraville
avcajaraville

Reputation: 9084

Duplicates terms on solr index

I have a doubt that I cannot answer to my self even when I was trying hard.

I think is a matter of comprehension.

So...

Does the solr index gonna have two terms pointing to the same item ?

This is my text analyzer:

<analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.GermanNormalizationFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" />
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.EnglishMinimalStemFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>

I though deletes duplicates entries, but when I have a look to the analysis found this:

screenshot

As far as I undestand solr, at the end, in my index there is gonna be this three terms pointing to that "item": chocolate, blablabla and chocolate. Is that right ?

I hope the question is clear :)

Thanks !

Upvotes: 0

Views: 574

Answers (1)

Aujasvi Chitkara
Aujasvi Chitkara

Reputation: 939

What you see after Analysis, is just before when text is indexed onto Solr. When you actually index it, it stores each term just once, and saves all occurrences of that term in form of (document_id, position).

Hope example below makes it more clear.

Suppose you want to add following three documents onto Solr:

T[0] = "dark chocolate is the best chocolate"

T[1] = "i love dark chocolate"

T[2] = "chocolate is delicious"

Solr will store in inverted index as follows:

"best": {(T[0], position)}

"chocolate": {(T[0], position1), (T[0], position2), (T[1], position), (T[2], position)}

"dark": {(T[0], position), (T[1], position)}

"delicious": {(T[2], position)}

"i": {(T[1], position)}

"is": {(T[0], position), (T[1], position)}

"love": {(T[0], position)}

"the": {(T[0], position)}

Note:

  • position stores the start offset and end offset of term in the document
  • chocolate term is stored once in index, but has two references to document T[0]

Upvotes: 7

Related Questions