Reputation: 73

Extract the first line of text from PDF

I'm new to C++, but not to programming. I'm trying to find a library that will allow me to extract the text from a PDF, preferably the first line of the PDF. A code example with the library would be appreciated.

The reason I'm trying to do this is to rename several hundred files based on the first line within the PDF(which happens to be the title in each).

Upvotes: 0

Answers (2)

ccxvii

Reputation: 2123

You don't need to take to C++ to achieve this; the "mutool" command that comes with MuPDF can print the text content of a page. The following command line will convert the first page of the PDF to plain text. That conversion comes with many caveats, but with most well formed PDF files this step should work fine. The output from mutool is then piped through sed to print only the first line.

mutool draw -F text -o - input.pdf 1 | sed 1q

Of course you can also do this using the MuPDF C library, but why waste time coding that when a simple shell script can do the job?

Now you can wrap this up in a script to rename your files. For example:

for INPUT in source-directory/*.pdf
do
    OUTPUT=$(mutool draw -F text -o - "$INPUT" 1 | sed 1q)
    cp "$INPUT" destination-directory/"$OUTPUT".pdf
done

Upvotes: 1

Mark Storer

Reputation: 15870

The challenge here is that PDF is a lot like SVG or PostScript. The order in which you position and display things need not have any relation to their logical/reading order.

As a terribly stilted example, one could draw all the 'a's on a page, then all the 'b's, and so on.

A far less stilted example (one I've seen in actual PDFs), is to draw all the text in a given font at once, then the next font, and so on. This is more challenging than you might think in that italic text is generally a distinct font, as is bold as is bold italic. If one is iterating through fonts in hash-table or alphabetic order, its reasonable to expect the title to not be the first text to be drawn by the page contents.

Having said all that, this is a solved problem, several times over.

The bad news: None of those solutions appear in the open source libraries linked in that first comment... 'cept maybe MuPDF, but it's not apparent from their online docs that they can.

The Good News: There's are several command-line driven applications, quite capable of extracting text from a PDF, all of which are described in an excellent answer here on SO: PDF Text Extraction with Coordinates

MuPDF's mutool is listed as one of the options, so it's clearly possible with MuPDF (built by the same company that does GhostScript).

Upvotes: 1

Extract the first line of text from PDF

Answers (2)

Related Questions