rei.cted
rei.cted

Reputation: 11

How do I separate solr cores so their results don't mix?

I'm trying to set up multiple solr cores (the data for each core is indexed using norconex, crawling entirely separate sites). The schema and solrconfig files are the same for all cores but there is a copy in each of their respective conf folders.

When I run a query in the admin UI for core 1, I'm getting a mix of results from info indexed to cores 2 and 3 as well. How do I keep them entirely separate? It was my understanding that having separate cores would do this by default?

I've tried clearing all documents from cores 2 and 3, but core 1 still pulls up their docs. Thanks for any help anyone can provide.

Upvotes: 1

Views: 88

Answers (2)

LeeWallen
LeeWallen

Reputation: 98

The issue you're describing above sounds like it could be that you have cores 1 through 3 on the same shard. That means that they would be replicas of each other and have the same data. If core1 were to be killed and replaced with another core, then data from the other cores would be replicated to the new core when the new core was added to the collection.

If you want subsets of documents in three separate cores (the physical locations), then those cores need to live in three separate shards (the logical locations). This can be accomplished using routing.

The compositeId router will let you send documents or queries to specific shards. The documentation shows an example of using data from a company field as part of the routing key value like this: "IBM!12345"

The exclamation point is a separator to break the key into the various parts used for creating the shard hash value. This allows sending "IBM" data to one shard, and "YOYODYNE" can be sent to another shard.

If "YOYODYNE" had way more documents than "IBM", then you might want to spread documents for "YOYODYNE" across multiple shards. The documentation says to use something like this:

Another use case could be if the customer "IBM" has a lot of documents and you want to spread it across multiple shards. The syntax for such a use case would be: shard_key/num!document_id where the /num is the number of bits from the shard key to use in the composite hash.

So IBM/3!12345 will take 3 bits from the shard key and 29 bits from the unique doc id, spreading the tenant over 1/8th of the shards in the collection. Likewise if the num value was 2 it would spread the documents across 1/4th the number of shards. At query time, you include the prefix(es) along with the number of bits into your query with the route parameter (i.e., q=solr&route=IBM/3!) to direct queries to specific shards.

Upvotes: 0

Alexandre Rafalovitch
Alexandre Rafalovitch

Reputation: 9789

This should not be happening. So, something has gone wrong. Possible options, from most likely and down:

  1. You are - accidentally - indexing into single core (as mentioned in comments). This is most likely. Perhaps you got URL wrong or the software is using some old convention of naming the core through URL parameters. Try to intercept the URL actually used for indexing and see how they are different when software thinks it indexes into different cores. The core name should be in the URL itself (e.g. http://server:8983/solr/core1).
  2. You have created a SolrCloud collection but are trying to index into individual cores of that collection. You should be able to check that in Admin UI and usually the core names are quite noticeably specific.
  3. You have created an alias that spans multiple cores and are querying that instead of individual cores.
  4. You have accidentally pointed several of your cores to the same data directory.

You did not say what happens when you query core2. If it does not have any documents, then first outcome is most likely. If it does, there may be other issues in play.

Upvotes: 1

Related Questions