DavidVdd
DavidVdd

Reputation: 1018

Adding entities to solr using solrj and schema.xml

I would like to add entities to documents like you can do with the data-config. At the moment I'm indexing every page of my documents as a single document.

Now :

<solrDoc>
<id>1</id>
<docname>test.pdf</docmname>
<pagenumber>1</pagenumber>
<pagecontent>blablabla</pagecontent>
</solrDoc>

<solrDoc>
<id>2</id>
<docname>test.pdf</docmname>
<pagenumber>2</pagenumber>
<pagecontent>blablabla</pagecontent>
</solrDoc>

As you can see the data related to the document is stored x pages times. I would like to get documents like this:

<doc>
<id>1</id>
<docname>test.pdf</docmname>
<pageEntries> //multivaluefield
<pageEntry><pagenumber>1</pagenumber><pagecontent>blablabla</pagecontent></pageEntry>
<pageEntry><pagenumber>2</pagenumber><pagecontent>blablabla</pagecontent></pageEntry>
</pageEntries>
</doc>

I don't know how to make something like pageEntry. I saw that solr can import entities from databases but I'm wondering how I can do the same? (or something similar)

I'm using solr 3.6.1. The page extraction is done by myself using pdfbox.

Java code:

SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.setField("id", 1);
solrDoc.setField("filename", "test");
            for (int p : pages) {
                solrDoc.addField("page", p);
            }
            for (String pc : pagecont) {
                solrDoc.addField("pagecont", pc);
            }

Upvotes: 0

Views: 613

Answers (1)

Jayendra
Jayendra

Reputation: 52769

If the extraction is performed by you, you can club all the pages and feed it as a single Solr document with the pagenumber & pagecontent being multivalued fields.

You can use the same id for all the pages (with the id not being a primary field in the schema definition) and use Grouping (Field Collapsing) to group the results for the documents.

Upvotes: 1

Related Questions