Reputation: 2728
I have followed ideas from this thread but it does not work. https://unix.stackexchange.com/questions/6704/how-can-i-grep-in-pdf-files
pdftotext PercivalWalden.pdf - | grep 'Slepian'
pdftotext PercivalWalden.pdf - | grep 'Naive'
pdftotext PercivalWalden.pdf - | grep 'Filter'
I know for sure that 'Filter' appears at least 100 times in this book.
Any ideas?
Upvotes: 1
Views: 490
Reputation: 90213
If you really can grep a given string (that you can 'see' and read on a rendered or printed PDF page) from a PDF, even with the help of pdftotext
, then you must be very lucky indeed.
First off: most of the advice from the link you provided to unix.stackexchange.com
is very uninformed (to put it most politely). Most of the answers there are clearly written by people who are not familiar with the huge range of PDF variations out there.
In your case, you are trying to convert the file with the help of pdftotext
first, streaming the output to stdout.
There are many types of PDF where pdftotext
cannot extract the text at all. The reasons for this may be (listings below not complete):
The "text" that you see is not based on using a font. It may be one big raster image generated by a scan or other production process, then embedded into a PDF file shell. This may make the page only appear to be text strings.
The "text" that you see is not based on using a font. It may be a series of small vector drawings (or small raster images), that only look like text strings to our eyes and brain.
There are many software applications, which do convert fonts to so-called 'outlines'. The reason for this seemingly strange behaviour may be:
The font is embedded as a subset in the PDF file (by the PDF generating software -- users usually do not have much control over the details of this operation) and uses a 'custom' encoding, but the file does not provide a toUnicode
table to map the glyphs to characters.
'Glyphs' are the well-defined shapes in each font drawn on screen. Glyphs map to characters for the computer -- our eyes merely see these shapes and our brains translate these to characters without needing a toUnicode
table. Programs like pdftotext
require a toUnicode
table to reverse the translation of glyphs back to characters.
You can use a command line utility named pdffonts
to gain a first insight into the fonts used by your PDF file. Example output:
pdffonts paper-projectiris---final.pdf
name type encoding emb sub uni object ID
-------------------------- ------------ -------------- --- --- --- ---------
TCQJEF+CMCSC10 Type 1 Builtin yes yes no 96 0
VPAFLY+CMBX12 Type 1 Builtin yes yes no 97 0
CWAIXW+CMTI12 Type 1 Builtin yes yes no 98 0
OBMDLT+CMR12 Type 1 Builtin yes yes no 99 0
In this case, text extraction (and your method of grepping for strings) should work:
uni
(telling if a toUnicode
map is embedded in the PDF file)
says no
for each single font, the encoding
column does not contain custom
, but builtin
(meaning that a glyph->character mapping is provided with the font file, which is of type Type 1
.To sum it up: Without access to your PDF file it is impossible to tell why you cannot "grep" for the strings you are looking for!
Upvotes: 6