ar099968
ar099968

Reputation: 7577

pdfjs: get raw text from pdf with correct newline/withespace

Using pdf.js, i have made a simple function for extract the raw text from a pdf:

async getPdfText(path){

    const pdf = await PDFJS.getDocument(path);

    const pagePromises = [];
    for (let j = 1; j <= pdf.numPages; j++) {
        const page = pdf.getPage(j);

        pagePromises.push(page.then((page) => {
            const textContent = page.getTextContent();
            return textContent.then((text) => {
                return text.items.map((s) =>  s.str).join('');
            });
        }));
    }

    const texts = await Promise.all(pagePromises);
    return texts.join('');
}

// usage
getPdfText("C:\\my.pdf").then((text) => { console.log(text); });

however i can't find a way for extract correctly the new lines, all the text is extracted in only one line.

How extract correctly the text? i want extract the text in the same way as on desktop pc:

Open the pdf (doble click on the file) -> select all text (CTRL + A) -> copy the selected text (CTRL + C) -> paste the copied text (CTRL + V)

Upvotes: 6

Views: 4314

Answers (1)

LinkmanXBP
LinkmanXBP

Reputation: 176

I know the question is more than a year old, but in case anyone has the same problem.

As this post said :

In PDF there no such thing as controlling layout using control chars such as '\n' -- glyphs in PDF positioned using exact coordinates. Use text y-coordinate (can be extracted from transform matrix) to detect a line change.

So with pdf.js, you can use the transform property of the textContent.items object. Specifically box 5 of the table. If this value changes, then it means that there is a new line

Here's my code :

            page.getTextContent().then(function (textContent) {
                var textItems = textContent.items;
                var finalString = "";
                var line = 0;

                // Concatenate the string of the item to the final string
                for (var i = 0; i < textItems.length; i++) {
                    if (line != textItems[i].transform[5]) {
                        if (line != 0) {
                            finalString +='\r\n';
                        }

                        line = textItems[i].transform[5]
                    }                     
                    var item = textItems[i];

                    finalString += item.str;
                }

                var node = document.getElementById('output');
                node.value = finalString;
            });

As weird as it sounds, instead of using tranform, you can also use the fontName property. With each new line, the fontName changes.

Upvotes: 13

Related Questions