Samuel Pérez
Samuel Pérez

Reputation: 55

Need help indexing XML files into Solr using DataImportHandler

I don't know java, I don't know XML, and I don't know Lucene. Now that that's out of the way. I have been working to create a little project using apache solr/lucene. My problem is that I am unable to index the xml files. I think I understand how its supposed to work but I could be wrong. I am not sure what information is required for you to help me so I will just post the code.

<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<!-- This first entity block will read all xml files in baseDir and feed it into the second entity block for handling. -->
<entity name="AMMFdir" rootEntity="false" dataSource="null"
        processor="FileListEntityProcessor"
        fileName="^*\.xml$" recursive="true"
        baseDir="C:\Documents and Settings\saperez\Desktop\Tomcat\apache-tomcat-7.0.23\webapps\solr\data\AMMF_New"
        >
<entity 
        processor="XPathEntityProcessor"
        name="AMMF"
        pk="AcquirerBID"
        datasource="AMMFdir"
        url="${AMMFdir.fileAbsolutePath}"
        forEach="/AMMF/Merchants/Merchant/"
        transformer="DateFormatTransformer, RegexTransformer"
        >

    <field column="AcquirerBID" xpath="/AMMF/Merchants/Merchant/AcquirerBID" />
    <field column="AcquirerName" xpath="/AMMF/Merchants/Merchant/AcquirerName" />
    <field column="AcquirerMerchantID" xpath="/AMMF/Merchants/Merchant/AcquirerMerchantID" />

</entity>
</entity>
</document>

Example xml file

<?xml version="1.0" encoding="utf-8"?>
<AMMF xmlns="http://tempuri.org/XMLSchema.xsd" Version="11.2" CreateDate="2011-11-07T17:05:14" ProcessorBINCIB="422443" ProcessorName="WorldPay" FileSequence="18">
<Merchants Count="153">
    <Merchant ChangeIndicator="A" LocationCountry="840">
    <AcquirerBID>10029881</AcquirerBID>
    <AcquirerName>WorldPay</AcquirerName>
    <AcquirerMerchantID>*</AcquirerMerchantID>
    <Merchant ChangeIndicator="A" LocationCountry="840">
    <AcquirerBID>10029882</AcquirerBID>
    <AcquirerName>WorldPay2</AcquirerName>
    <AcquirerMerchantID>Hello World!</AcquirerMerchantID>
</Merchant>
</Merchants>

I have this in schema.

<field name="AcquirerBID" type="string" indexed="true" stored="true" required="true" /> 
<field name="AcquirerName" type="string" indexed="true" stored="true" />
<field name="AcquirerMerchantID" type="string" indexed="true" stored="true"/>

I have this in config.

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler" default="true" >
<lst name="defaults">
<str name="config">AMMFconfig.xml</str>
</lst>
</requestHandler>

Upvotes: 3

Views: 11434

Answers (3)

Marko Bonaci
Marko Bonaci

Reputation: 5716

To figure out how DIH XML import works, I suggest you first carefully read this chapter in DIH wiki: http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example.

Open the Slashdot link http://rss.slashdot.org/Slashdot/slashdot in your browser, then right click on the page and select View source. There's the XML file used in this example. Compare it with XPathEntityProcessor configuration in DIH example and you'll see how easy it is to import any XML file in Solr.

If you need more help just ask...

Upvotes: 1

Mark O&#39;Connor
Mark O&#39;Connor

Reputation: 78021

The sample XML is not well formed. This might explain errors indexing the files:

$ xmllint sample.xml
sample.xml:13: parser error : expected '>'
</Merchants>
          ^
sample.xml:14: parser error : Premature end of data in tag Merchants line 3
sample.xml:14: parser error : Premature end of data in tag AMMF line 2

Corrected XML

Here's what I think your sample data should look like (Didn't check the XSD file)

<?xml version="1.0" encoding="utf-8"?>
<AMMF xmlns="http://tempuri.org/XMLSchema.xsd" Version="11.2" CreateDate="2011-11-07T17:05:14" ProcessorBINCIB="422443" ProcessorName="WorldPay" FileSequence="18">
  <Merchants Count="153">
    <Merchant ChangeIndicator="A" LocationCountry="840">
      <AcquirerBID>10029881</AcquirerBID>
      <AcquirerName>WorldPay</AcquirerName>
      <AcquirerMerchantID>*</AcquirerMerchantID>
    </Merchant>
    <Merchant ChangeIndicator="A" LocationCountry="840">
      <AcquirerBID>10029882</AcquirerBID>
      <AcquirerName>WorldPay2</AcquirerName>
      <AcquirerMerchantID>Hello World!</AcquirerMerchantID>
    </Merchant>
  </Merchants>
</AMMF>

Alternative solution

I know you said you're not a programmer, but this task is significantly simpler, if you use the solrj interface.

The following is a groovy example which indexes your example XML

//
// Dependencies
// ============
import org.apache.solr.client.solrj.SolrServer
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
import org.apache.solr.common.SolrInputDocument

@Grapes([
    @Grab(group='org.apache.solr', module='solr-solrj', version='3.5.0'),
])

//
// Main
// =====

SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr/");
def i = 1

new File(".").eachFileMatch(~/.*\.xml/) { 

    it.withReader { reader ->
        def ammf = new XmlSlurper().parse(reader)

        ammf.Merchants.Merchant.each { merchant ->
            SolrInputDocument doc = new SolrInputDocument();

            doc.addField("id",           i++)
            doc.addField("bid_s",        merchant.AcquirerBID)
            doc.addField("name_s",       merchant.AcquirerName)
            doc.addField("merchantId_s", merchant.AcquirerMerchantID)

            server.add(doc)
        }
    }

}

server.commit()

Groovy is a Java scripting language that does not require compilation. It would be just as easy to maintain as a DIH config file.

Upvotes: 2

mlissner
mlissner

Reputation: 18206

Often the best thing to do is NOT use the DIH. How hard would it be to just post this data using the API and a custom script in a language you DO know?

The benefit of this approach is two-fold:

  1. You learn more about your system, and know it better.
  2. You don't spend time trying to understand the DIH.

The downside is that you're re-inventing the wheel a bit, but the DIH is quite a thing to understand.

Upvotes: 0

Related Questions