Solr DataImportHandler - indexing multiple, related XML documents

Question

Let's say I have two XML document types, A and B, that look like this:

A:


    
        First Number
        1
    
    
        Second Number
        2

B:

I'd like to index it like this:


    First Name
    1
    one


    Second Name
    2
    two

So, in effect, I'm trying to use a value from A as a key in B. Using DataImportHandler, I've used the following as my data config definition:

However, I encounter two problems:

I can't get the XPath expression with the predicate to match any rows; regardless of whether I use an alternative like /xml/b[aKey=${a.num}]/value, or even hardcoded value for aKey.
Even when I remove the predicate, the parser goes through the B file once for every row in A, which is obviously inefficient.

My question is: how, in light of the problems listed above, do I index the data correctly and efficiently with the DataImportHandler?

I'm using Solr 3.6.2 .

Note: This is a bit similar to this question, but it deals with two XML document types instead of a RDBMS and an XML document.

mikołak · Accepted Answer

I finally went with another solution due to an additional design requirement I didn't originally mention. What follows is the explanation and discussion. So....

If you only have one or a couple of import flow types for your Solr instances:

Then it might be best to go with Achim's answer and develop your own importer - either, as Achim suggests, in your favorite scripting language, or, in Java, using SolrJ's ConcurrentUpdateSolrServer.

This is because the DataImportHandler framework does have a sudden spike in its learning curve once you need to define more complex import flows.

If you have a nontrivial number of different import flows:

Then I would suggest you consider staying with the DataImportHandler since you will probably end up implementing something similar anyway. And, as the framework is quite modular and extendable, customization isn't a problem.

This is the additional requirement I mentioned, so in the end I went with that route.

How I solved my particular quandary was indexing the files I needed to reference into separate cores and using a modified SolrEntityProcessor to access that data. The modifications were as follows:

applying the patch for the sub-entity problem,
adding caching (quick solution using Guava, there's probably a better way using an available Solr API for accessing other cores locally, but I was in a bit of a hurry at that point).

If you don't want to create a new core for each file, an alternative would be an extension of Achim's idea, i.e. creating a custom EntityProcessor that would preload the data and enable querying it somehow.

Solr DataImportHandler - indexing multiple, related XML documents

Answers (2)

If you only have one or a couple of import flow types for your Solr instances:

If you have a nontrivial number of different import flows:

Related Questions