Aliya Chudhary
Aliya Chudhary

Reputation: 11

How to extract text from a PDF using the PDFParser library?

I am new to Swift and iOS. I am working on a project where I need to extract text from a PDF. I know about the PDFKit framework, however, I get memory issues because I want to loop through the pages.

For that reason I found a library called PDFParser which almost solves most of my problem. But sometimes, when there is a complex PDF, it doesn't work well and gives me the wrong result.

I created a simple function that extracts whole text from the page using the Parser function:

extension SimpleDocumentIndexer {
    public func extractWholeTextFromPage(pageNumber: Int) -> String {
        guard let pageIndex = pageIndexes[pageNumber] else {
            return ""
        }

        var wholeText = ""
        for textBlock in pageIndex.textBlocks {
            wholeText.append(textBlock.chars)
        }
        return wholeText
    }
}

What I want to achieve is:

  1. Loop through all the pages of the PDF (without memory crashing issue) and extracting the whole text, compare it with a list of words or line or maybe paragraph. Check if the text is found within the PDF page text.
  2. Get the coordinates/CGRect of that word or get the before and after text i.e. left and right to the actual word/line/paragraph found at the page.

I tried various approaches including PDFKit's PDFPage.string functionality and finding the text, but it throws memory issues. Other parsing libraries.

P.S.: I don't want to go with any paid library because all I want is an offline solution that can be done on the user's device. As well as I know the pdfdocument.findString method, but I want a more specific approach that I mentioned above.

Upvotes: 1

Views: 143

Answers (0)

Related Questions