mikołak
mikołak

Reputation: 9705

Solr DataImportHandler - indexing multiple, related XML documents

Let's say I have two XML document types, A and B, that look like this:

A:

<xml>
    <a>
        <name>First Number</name>
        <num>1</num>
    </a>
    <a>
        <name>Second Number</name>
        <num>2</num>
    </a>
</xml>

B:

<xml>
    <b>
        <aKey>1</aKey>
        <value>one</value>
    </b>
    <b>
        <aKey>2</aKey>
        <value>two</value>
    </b>
</xml>

I'd like to index it like this:

<doc>
    <str name="name">First Name</str>
    <int name="num">1</int>
    <str name="spoken">one</str>
</doc>
<doc>
    <str name="name">Second Name</str>
    <int name="num">2</int>
    <str name="spoken">two</str>
</doc>

So, in effect, I'm trying to use a value from A as a key in B. Using DataImportHandler, I've used the following as my data config definition:

<dataConfig>
    <dataSource type="FileDataSource" encoding="UTF-8" />
    <document>
        <entity name="document" transformer="LogTransformer" logLevel="trace"
            processor="FileListEntityProcessor" baseDir="/tmp/somedir"
            fileName="A.*.xml$" recursive="false" rootEntity="false"
            dataSource="null">
            <entity name="a"
                transformer="RegexTransformer,TemplateTransformer,LogTransformer"
                logLevel="trace" processor="XPathEntityProcessor" url="${document.fileAbsolutePath}"
                stream="true" rootEntity="true" forEach="/xml/a">
                <field column="name" xpath="/xml/a/name" />
                <field column="num" xpath="/xml/a/num" />


                <entity name="b" transformer="LogTransformer"
                    processor="XPathEntityProcessor" url="/tmp/somedir/b.xml"
                    stream="false" forEach="/xml/b" logLevel="trace">
                    <field column="spoken" xpath="/xml/b/value[../aKey=${a.num}]" />
                </entity>

            </entity>
        </entity>
    </document>
</dataConfig>

However, I encounter two problems:

  1. I can't get the XPath expression with the predicate to match any rows; regardless of whether I use an alternative like /xml/b[aKey=${a.num}]/value, or even hardcoded value for aKey.
  2. Even when I remove the predicate, the parser goes through the B file once for every row in A, which is obviously inefficient.

My question is: how, in light of the problems listed above, do I index the data correctly and efficiently with the DataImportHandler?

I'm using Solr 3.6.2 .

Note: This is a bit similar to this question, but it deals with two XML document types instead of a RDBMS and an XML document.

Upvotes: 3

Views: 2413

Answers (2)

mikołak
mikołak

Reputation: 9705

I finally went with another solution due to an additional design requirement I didn't originally mention. What follows is the explanation and discussion. So....

If you only have one or a couple of import flow types for your Solr instances:

Then it might be best to go with Achim's answer and develop your own importer - either, as Achim suggests, in your favorite scripting language, or, in Java, using SolrJ's ConcurrentUpdateSolrServer.

This is because the DataImportHandler framework does have a sudden spike in its learning curve once you need to define more complex import flows.

If you have a nontrivial number of different import flows:

Then I would suggest you consider staying with the DataImportHandler since you will probably end up implementing something similar anyway. And, as the framework is quite modular and extendable, customization isn't a problem.

This is the additional requirement I mentioned, so in the end I went with that route.

How I solved my particular quandary was indexing the files I needed to reference into separate cores and using a modified SolrEntityProcessor to access that data. The modifications were as follows:

  • applying the patch for the sub-entity problem,
  • adding caching (quick solution using Guava, there's probably a better way using an available Solr API for accessing other cores locally, but I was in a bit of a hurry at that point).

If you don't want to create a new core for each file, an alternative would be an extension of Achim's idea, i.e. creating a custom EntityProcessor that would preload the data and enable querying it somehow.

Upvotes: 0

Achim
Achim

Reputation: 15722

I have very bad experiences using DataImportHandler for that kind of data. A simple python script to merge your data would probably be smaller than your current configuration and much more readable. Depending on your requirements and data size, you could create a temporary xml file or you could directly pipe results to SOLR. If you really have to use the DataImportHandler, you could use a URLDataSource and setup a minimal server which generates your xml. Obvioulsy I'm a Python fan, but it's quite likely that it's also an easy job in Ruby, Perl, ...

Upvotes: 2

Related Questions