andronikus
andronikus

Reputation: 4210

Print contents of a PDF to the command line

I'm looking for a command-line program that will print out the text of a PDF file, just like cat for a text file.

I've found pdftotxt, and that would be workable, but I'd prefer something that replicates the cat functionality because I want to pipe to grep. Thanks!

Upvotes: 18

Views: 17277

Answers (2)

jsvk
jsvk

Reputation: 1729

On the man pages for pdftotext, I found this:

pdftotext [options] [PDF-file [text-file]]

Description Pdftotext converts Portable Document Format (PDF) files to plain text.

Pdftotext reads the PDF file, PDF-file, and writes a text file, text-file. If text-file is not specified, pdftotext converts file.pdf to file.txt. If text-file is '-', the text is sent to stdout.

Thus to output to stdout in order to pipe to grep use this:

pdftotext mydoc.pdf - | grep mysearchterm

Upvotes: 48

luochen1990
luochen1990

Reputation: 3847

Maybe you can try this: https://github.com/luochen1990/nodejs-easy-pdf-parser

It is a npm package and you need to install nodejs (and npm) to use it.

It can be used as a command line tool:

npm install -g easy-pdf-parser
pdf2text test.pdf > test.txt

And this tool will sort text lines by their y coordinates, so it works great at most case. And it also works well with unicode and cross platform (as comparison: mingw64's pdftotext will lose unicode characters on windows).

Upvotes: 2

Related Questions