Reputation: 1551
I'm trying to index multiple databases into one solr index. I've been reading the solr wiki on multiple data sources and trying to fiddle with different kind of settings but I'm not able to get the desired result.
My configuration looks like this:
<dataConfig>
<!-- Has 10000 items -->
<dataSource name="ds1" driver="org.h2.Driver" url="jdbc:h2:file:/path/to/first" />
<!-- Has ~7000 items -->
<dataSource name="ds2" driver="org.h2.Driver" url="jdbc:h2:file:/path/to/second" />
<document name="myDocName">
<entity name="firstEntity" rootEntity="true"
dataSource="ds1" query="SELECT * FROM BLAH"
transformer="my.Transformer" threads="4">
... <!-- field configuration here -->
</entity>
<entity name="secondEntity" rootEntity="true"
dataSource="ds2" query="SELECT * FROM BLAH"
transformer="my.Transformer" threads="4">
... <!-- field configuration here -->
</entity>
</document>
</dataConfig>
Now, we're currently working with testdata so I do know how many records there are in each database, the first one contains ~7000 and the second 10000. When I start the indexing I get an info message that there are ~17000 adds:
INFO: {deleteByQuery=*:*,add=[5, 1, 2, 6, 7, 4, 8, 3, ... (17069 adds)],commit=} 0
However, when I run the * : * query in the web interface I only get 10000 results (which is exactly the number of items in the largest db. This seems to suggest that 7096 documents have to entities in there while the remaining only one.
I tried to have to documents elements in the configuration file but this resulted that only one got imported (probably because the have the same name, i.e. document name="myDocName" was configured identically for the two document elements).
At this point I'm stuck and don't know how to properly configure this. The only thing I can additionally think of is that I have to index both databases separately but the workflow for this is not fully clear to me either. Any help would be appreciated.
Update 1: I tried giving both entities different names (something which is required according to the documentation anyhow) but this results in the following behavior. First documents are added from the first db, next the first N existing documents are overwritten with documents from the second db, where N is the number of records in the second db. Obviously this is not what I want, I want N additional documents. Adding a second document element in the configuration doesn't seem to work either.
Update 2: According to comments in this bug report: https://issues.apache.org/jira/browse/SOLR-895, root entities in the document tag should result in new documents for these entities. This is not what is happening for me. Setting rootEntity="true" on each entity tag explicitly doesn't change anything as well. Result is still that after import I only have 10000 documents in stead of the expected 17000.
Upvotes: 0
Views: 3166
Reputation: 106
I suppose you have unique keys conflicts. Do you have the same IDs across two different databases? Try changing queries to
- ds1 - "SELECT "ds1" || id AS id, field1, field2 FROM table1"
- ds2 - "SELECT "ds2" || id AS id, field1, field2 FROM table2"
I would remove multithreaded option (threads="4"), there's no really significant performance improvement over single threaded case and it's not really stable (it was removed in 4.0 release).
Upvotes: 2