Vicky
Vicky

Reputation: 17375

Configuring SOLR app for indexing pdf documents

I am completely new to Apache SOLR/Lucene but want to use it for indexing PDF documents.

I have started learning by following the official tutorial:

[Apache SOLR 4.6.0 Tutorial][1]

I am able to reach the point in the tutorial with heading "Indexing Data" where they index two .xml files.

However, I am not able to follow anything after below lines in that section and all the sections after that.

You have now indexed two documents in Solr, and committed these changes. You can now search for "solr" by loading the "Query" tab in the Admin interface, and entering "solr" in the "q" text box. Clicking the "Execute Query" button should display the following URL containing one result... 

Its too confusing with too little information.

Can anyone please point to some basic tutorial on SOLR which teaches how to configure SOLR and index .pdf documents there after.

From the tutorial it seems that Solr Cell (ExtractingRequestHandler) is the way to go. But what is that and how to use it with the setup I have made referring the steps in the tutorial is what I am not comprehending/understanding.

There are some questions on stack overflow as well on pdf indexing with SOLR but they are either too specific or the answers are too high level for my understanding. I am in need of a basic step by step tutorial for pdf indexing with SOLR.

Thanks for reading!

Upvotes: 0

Views: 1154

Answers (1)

Rahul Shardha
Rahul Shardha

Reputation: 399

To start with, you should look at how Solr actually works.

NOT LITERALLY but something close:

-> : can be translated as

A core in Solr -> a table in SQL

A document in Solr -> a record in the table

A document can have any number of fields (Like columns in a table). (ID, NAME, EMAIL, etc...)

A field has a type (Like a variable (comes from Lucene's classes)(String, UUID, etc...)) A field can be indexed (searchable) and stored (retrieve as is).

Now you have to decide what implementation you want. A single core (table) implementation is the easiest but for almost all use cases for Solr, you'll want to use a multicore setup.

In the Solr 4.6.0 directory you downloaded, browse to example and run start.jar with the following command : java -Dsolr.solr.home=multicore -jar star.jar

Open up http://localhost:8983/solr browse around, you'll learn a lot by observation.

Next go to the multicore directory under example.

You will see a solr.xml file. Open it. At the bottom you will have definition of cores. Add a line with YOUR_CORE_NAME

Once you have that, save the file, run solr. You'll see a bunch of errors regarding: cannot find solrconfig.xml, schema.xml for YOUR_CORE_NAME.

These files are important because:

solrconfig.xml: contains how your core (table) will behave while Solr is running. Extremely customizable, extremely useful but too much for someone starting Solr (you learn on the fly). for now I'll have you copy a solrconfig.xml from one of the other cores.

schema.xml : This is like your table definition. This is where you define your "fields" (columns). Take a look at the schema for the other cores and read

http://wiki.apache.org/solr/SchemaXml

Make a simple schema, 3 fields. Play close attention to analyzers, for now use the Lucene standard analyzer. It's extremely good and works for most use cases.

Now the directory structure: Inside multicore, make a folder named YOUR_CORE_NAME.

Under YOUR_CORE_NAME: make a conf folder and place your solrconfig.xml and schema.xml inside this folder.

Start solr. It should now boot up without any errors.

Once you have that, keep tweaking the schema.xml until you come up with what you're looking for.

Upvotes: 1

Related Questions