Martin S Ek
Martin S Ek

Reputation: 2043

Duplicates in Solr index - items added twice or more times

Consider you have a Solr index with approx. 20 Million items. When you index these items they are added to the index in batches.

Approx 5 % of all these items are indexed twice or more times, therefore causing a duplicates problem.

If you check the log, you can actually see these items are indeed added twice (or more). Often with an interval of 2-3 minutes between them, and other items between them too.

The web server which triggers the indexing is in a load balanced environment (2 web servers). However, the web server who does the actual indexing is a single web server.

Here are some of the config elements in solrconfig.xml:

<indexDefaults>
.....
<mergeFactor>10</mergeFactor>
<ramBufferSizeMB>128</ramBufferSizeMB>
<maxFieldLength>10000</maxFieldLength>
<writeLockTimeout>1000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>

<mergePolicy class="org.apache.lucene.index.LogByteSizeMergePolicy">
<double name="maxMergeMB">1024.0</double>
</mergePolicy>

<mainIndex>
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>128</ramBufferSizeMB>
<mergeFactor>10</mergeFactor>

I'm using Solr 1.4.1 and Tomcat 7.0.16. Also I'm using the latest SolrNET library.

What might cause this duplicates problem? Thanks for all input!

Upvotes: 3

Views: 6682

Answers (5)

Umar
Umar

Reputation: 2849

To answer your question completely i should be able to know the schema. There is a unique id field in the schema that works more like the unique key in the db, make sure the unique identifier of the document is made the unique key then the duplicates will be overwritten to keep just one value.

Upvotes: 6

Martin S Ek
Martin S Ek

Reputation: 2043

Ok, it turned out there was a couple of bugs in the code updating the index. Instead of updating, we always had a document added to index, even tho it already existed.

It wasn't overwritten because every document in our Solr index has its own GUID.

Thank you for your answers and time!

Upvotes: 0

Yuriy
Yuriy

Reputation: 1984

It is not possible to have two documents with identical value in their field marked as a unique id in the schema. Adding two documents with the same value will just result in the latter one overwriting (replacing) the previous.

So it sounds like it is your mistake and the documents are not really identical.

Make sure your schema and id fields are correct.

Upvotes: 4

Dorin
Dorin

Reputation: 2542

As a completion to what was said above, a solution, in this case, can be to generate a unique ID (or to define one of the fields as an unique ID) for the document from code, before sending it to SOLR.

In this case you make sure that the document you want to update will be overwritted and not recreated.

Upvotes: 1

Martin S Ek
Martin S Ek

Reputation: 2043

Actually, all added documents will have an auto generated unique key, through Solr's own uuid type:

<field name="uid" type="uuid" indexed="true" stored="true" default="NEW"/>

So any document added to the index will be considered a new one, since it gets a GUID. However, I think we've got a problem with some other code here, code that adds items to the index when they are updated, instead of just updating them..

I'll be back! Thanks so far!

Upvotes: 0

Related Questions