Rahul Kumar Jha
Rahul Kumar Jha

Reputation: 359

Searching text in pdf using php

I am having a big database roughly it has 5 lakh (500K) entries now all those entries also have some document associated with them (i.e. every id has at least pdf file). Now I need a robust method to search for a particular text in those pdf files and if I find it, it should return the respective 'id'

kindly share some fast and optimized ways to search text in a pdf using PHP. Any idea will be appreciated.

note: Changing the pdf to text and then searching is not what I am looking for obviously, it will take a longer time.

In one line I need the best way to search for text in pdf using PHP

Upvotes: 3

Views: 1254

Answers (2)

Rick James
Rick James

Reputation: 142208

If this is a one-time task, there is probably no 'fast' solution.

If this is a recurring task,

  1. Extract the text via some tool. (Sorry, I don't know of a tool.)
  2. Store that text in a database table.
  3. Apply a FULLTEXT index to that table.

Now the searching will be fast.

Upvotes: 2

Gishas
Gishas

Reputation: 555

I myself wrote a website in ReactJS to search for info in PDF files (indexed books), which I indexed using Apache SOLR search engine.

What I did in React is, in essence:

queryValue = "(" + queryValueTerms.join(" OR ") + ")"

    let query = "http://localhost:8983/solr/richText/select?q="
    let queryElements = []

    
    if(searchValue){
      queryElements.push("text:" + queryValue)
    }

...

 fetch(query)
      .then(res => res.json())
      .then((result) =>{
        setSearchResults(prepareResults(result.response.docs, result.highlighting))
        setTotal(result.response.numFound)
        setHasContent(result.response.numFound > 0)
      })

Which results in a HTTP call:

http://localhost:8983/solr/richText/select?q=text:(chocolate%20OR%20cake)

Since this is ReactJS and just parts of code, it is of little value to you in terms of PHP, but I just wanted to demonstrate what the approach was. I guess you'd be using Curl or whatever.

Indexing itself I did in a separate service, using SolrJ, i.e. I wrote a rather small Java program that utilizes SOLR's own SolrJ library to add PDF files to SOLR index.

If you opt for indexing using Java and SolrJ (was the easiest option for me, and I didn't do Java in years previously), here are some useful resources and examples, which I collected following extensive search for my own purposes:

https://solr.apache.org/guide/8_5/using-solrj.html#using-solrj

I basically copied what's here: https://lucidworks.com/post/indexing-with-solrj/ and tweaked it for my needs.

Tip: Since I was very rusty with Java, instead of setting classpaths etc, quick solution for me was to just copy ALL libraries from SOLR's solrj folder, to my Java project. And possibly some other libraries. May be ugly, but did the job for me.

Upvotes: 1

Related Questions