Saqib Iqbal
Saqib Iqbal

Reputation: 349

Extracting PDF from Apache Solr

I am new to Solr indexing. I used Solr 5.5 and indexed a pdf file in it by simply using

#bin/post -c gettingstarted /home/ubuntu/pdf.pdf

I deleted the source pdf file. Is there anyway I can extract the pdf file from Apache Solr. I can see it is indexed from the URL

http://localhost:8983/solr/gettingstarted/select?q=*.pdf

Thanks in advance.

Upvotes: 0

Views: 990

Answers (1)

Vinod
Vinod

Reputation: 1953

If it indexed properly by default pdf content is indexed into field name content if it declared in schema correctly. so search some keyword (or *) using that content field.

Ex: q=content:keyword (keyword -> which is present in pdf)

http://localhost:8983/solr/gettingstarted/select?q=content:*

If contetnt field is undefined. then add field definition in schema file.

Ex: Field name declaration

<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>

Field Type defintion

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Upvotes: 1

Related Questions