How to exclude text indexed from PDF in solr query

I have a solr index generated from a catalog of PDF files and correspoing metadata fields pertaining to the pdf files themselves. Still, I would like to provide my users an option to exclude in the query any text indexed from within a PDF. This is so the query results would be based on the metadata fields instead and not biased by the vast text within the pdf files.

I have thought of maybe having two indexes (cores) - one with the indexed pdf files and one without.

Is there another way?

Upvotes: 0

Views: 202

Answers (3)

Anand
Anand

Reputation: 81

You can look at field aliases

If you have 3 index fields

  • pdfmeta
  • pdftext

Then you can create two field aliases

  • quicksearch : pdfmeta
  • fullsearch : pdfmeta, pdftext

One advantage of using a field alias over qf is if your users have bookmarks like q=quicksearch:value, you can change the alias for quicksearch without affecting the user's bookmark.

Upvotes: 0

Alexandre Rafalovitch
Alexandre Rafalovitch

Reputation: 9789

Sounds like you are doing a general search against a default field. Which means you have a lot of copyField instructions (or just one copyField * -> text), which include the PDF content field.

You can create a second destination and copyField everything but the PDF content field into that as well. This way, users can search against or another combined field.

However, remember that this parses all content according to the analysis chain of the destination field. So, eDisMax with a list of source fields may be a better approach there. And, remember, you can use several request handlers (like 'select') and define different default parameters there. That usually makes the client code a bit easier.

Upvotes: 1

Binoy Dalal
Binoy Dalal

Reputation: 896

You do not need to use 2 separate indexes. You can use the edismax parser and specify the qf parameter at query time. That will help determine what fields are searched.

Upvotes: 0

Related Questions