Reputation: 303

How to read and extract the text within a pdf?

Currently can retrieve the blob. But not sure how to read the text within the pdf. Any help?

    this.http.get('/assets/img/1.pdf', {responseType: 'blob'}).subscribe(data => {
        console.log(data);
    })

Upvotes: 1

Answers (1)

Thilina Koggalage

Reputation: 1084

I think the best library you can use for this is pdf.js. Availability of WebWorkers in the browser is required to use this library. It is dealing with lot of promises.

Please note that, extracted text might not be in same format as the text in pdf and there could be issues in the text order also. You can see that in this example. You may have to do some work arounds, such as replacing all white-spaces with single spaces to make the extracted text looks good. You can also take a look at OCR (optical character recognition) for a solution.

This example will give you an idea about how it works.

function getPageText(pageNum, PDFDocumentInstance) {
    return new Promise(function (resolve, reject) {
        PDFDocumentInstance.getPage(pageNum).then(function (pdfPage) {
            pdfPage.getTextContent().then(function (textContent) {
                var textItems = textContent.items;
                var finalString = "";

                for (var i = 0; i < textItems.length; i++) {
                    var item = textItems[i];

                    finalString += item.str + " ";
                }

                resolve(finalString);
            });
        });
    });
}

You can refer this article to get a better idea. How to convert PDF to Text

Upvotes: 3

How to read and extract the text within a pdf?

Answers (1)

Related Questions