Searching text in pdf using php

Question

I am having a big database roughly it has 5 lakh (500K) entries now all those entries also have some document associated with them (i.e. every id has at least pdf file). Now I need a robust method to search for a particular text in those pdf files and if I find it, it should return the respective 'id'

kindly share some fast and optimized ways to search text in a pdf using PHP. Any idea will be appreciated.

note: Changing the pdf to text and then searching is not what I am looking for obviously, it will take a longer time.

In one line I need the best way to search for text in pdf using PHP

Gishas · Accepted Answer

I myself wrote a website in ReactJS to search for info in PDF files (indexed books), which I indexed using Apache SOLR search engine.

What I did in React is, in essence:

queryValue = "(" + queryValueTerms.join(" OR ") + ")"

    let query = "http://localhost:8983/solr/richText/select?q="
    let queryElements = []

    
    if(searchValue){
      queryElements.push("text:" + queryValue)
    }

...

 fetch(query)
      .then(res => res.json())
      .then((result) =>{
        setSearchResults(prepareResults(result.response.docs, result.highlighting))
        setTotal(result.response.numFound)
        setHasContent(result.response.numFound > 0)
      })

Which results in a HTTP call:

http://localhost:8983/solr/richText/select?q=text:(chocolate%20OR%20cake)

Since this is ReactJS and just parts of code, it is of little value to you in terms of PHP, but I just wanted to demonstrate what the approach was. I guess you'd be using Curl or whatever.

Indexing itself I did in a separate service, using SolrJ, i.e. I wrote a rather small Java program that utilizes SOLR's own SolrJ library to add PDF files to SOLR index.

If you opt for indexing using Java and SolrJ (was the easiest option for me, and I didn't do Java in years previously), here are some useful resources and examples, which I collected following extensive search for my own purposes:

https://solr.apache.org/guide/8_5/using-solrj.html#using-solrj

I basically copied what's here: https://lucidworks.com/post/indexing-with-solrj/ and tweaked it for my needs.

Tip: Since I was very rusty with Java, instead of setting classpaths etc, quick solution for me was to just copy ALL libraries from SOLR's solrj folder, to my Java project. And possibly some other libraries. May be ugly, but did the job for me.

Searching text in pdf using php

Answers (2)

Related Questions