Reputation: 9705
Let's say I have two XML document types, A and B, that look like this:
A:
<xml>
<a>
<name>First Number</name>
<num>1</num>
</a>
<a>
<name>Second Number</name>
<num>2</num>
</a>
</xml>
B:
<xml>
<b>
<aKey>1</aKey>
<value>one</value>
</b>
<b>
<aKey>2</aKey>
<value>two</value>
</b>
</xml>
I'd like to index it like this:
<doc>
<str name="name">First Name</str>
<int name="num">1</int>
<str name="spoken">one</str>
</doc>
<doc>
<str name="name">Second Name</str>
<int name="num">2</int>
<str name="spoken">two</str>
</doc>
So, in effect, I'm trying to use a value from A as a key in B. Using DataImportHandler, I've used the following as my data config definition:
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity name="document" transformer="LogTransformer" logLevel="trace"
processor="FileListEntityProcessor" baseDir="/tmp/somedir"
fileName="A.*.xml$" recursive="false" rootEntity="false"
dataSource="null">
<entity name="a"
transformer="RegexTransformer,TemplateTransformer,LogTransformer"
logLevel="trace" processor="XPathEntityProcessor" url="${document.fileAbsolutePath}"
stream="true" rootEntity="true" forEach="/xml/a">
<field column="name" xpath="/xml/a/name" />
<field column="num" xpath="/xml/a/num" />
<entity name="b" transformer="LogTransformer"
processor="XPathEntityProcessor" url="/tmp/somedir/b.xml"
stream="false" forEach="/xml/b" logLevel="trace">
<field column="spoken" xpath="/xml/b/value[../aKey=${a.num}]" />
</entity>
</entity>
</entity>
</document>
</dataConfig>
However, I encounter two problems:
/xml/b[aKey=${a.num}]/value
, or even hardcoded value for aKey
.My question is: how, in light of the problems listed above, do I index the data correctly and efficiently with the DataImportHandler?
I'm using Solr 3.6.2 .
Note: This is a bit similar to this question, but it deals with two XML document types instead of a RDBMS and an XML document.
Upvotes: 3
Views: 2413
Reputation: 9705
I finally went with another solution due to an additional design requirement I didn't originally mention. What follows is the explanation and discussion. So....
Then it might be best to go with Achim's answer and develop your own importer - either, as Achim suggests, in your favorite scripting language, or, in Java, using SolrJ's
ConcurrentUpdateSolrServer
.
This is because the DataImportHandler framework does have a sudden spike in its learning curve once you need to define more complex import flows.
Then I would suggest you consider staying with the DataImportHandler since you will probably end up implementing something similar anyway. And, as the framework is quite modular and extendable, customization isn't a problem.
This is the additional requirement I mentioned, so in the end I went with that route.
How I solved my particular quandary was indexing the files I needed to reference into separate cores and using a modified SolrEntityProcessor
to access that data. The modifications were as follows:
If you don't want to create a new core for each file, an alternative would be an extension of Achim's idea, i.e. creating a custom EntityProcessor
that would preload the data and enable querying it somehow.
Upvotes: 0
Reputation: 15722
I have very bad experiences using DataImportHandler for that kind of data. A simple python script to merge your data would probably be smaller than your current configuration and much more readable. Depending on your requirements and data size, you could create a temporary xml file or you could directly pipe results to SOLR. If you really have to use the DataImportHandler, you could use a URLDataSource and setup a minimal server which generates your xml. Obvioulsy I'm a Python fan, but it's quite likely that it's also an easy job in Ruby, Perl, ...
Upvotes: 2