MikiBelavista
MikiBelavista

Reputation: 2728

How to search my PDF with grep?

I have followed ideas from this thread but it does not work. https://unix.stackexchange.com/questions/6704/how-can-i-grep-in-pdf-files

 pdftotext PercivalWalden.pdf - | grep 'Slepian'
 pdftotext PercivalWalden.pdf - | grep 'Naive'
 pdftotext PercivalWalden.pdf - | grep 'Filter'

I know for sure that 'Filter' appears at least 100 times in this book.

Any ideas?

Upvotes: 1

Views: 490

Answers (1)

Kurt Pfeifle
Kurt Pfeifle

Reputation: 90213

If you really can grep a given string (that you can 'see' and read on a rendered or printed PDF page) from a PDF, even with the help of pdftotext, then you must be very lucky indeed.

First off: most of the advice from the link you provided to unix.stackexchange.com is very uninformed (to put it most politely). Most of the answers there are clearly written by people who are not familiar with the huge range of PDF variations out there.

In your case, you are trying to convert the file with the help of pdftotext first, streaming the output to stdout.

There are many types of PDF where pdftotext cannot extract the text at all. The reasons for this may be (listings below not complete):

  1. The "text" that you see is not based on using a font. It may be one big raster image generated by a scan or other production process, then embedded into a PDF file shell. This may make the page only appear to be text strings.

  2. The "text" that you see is not based on using a font. It may be a series of small vector drawings (or small raster images), that only look like text strings to our eyes and brain.

    There are many software applications, which do convert fonts to so-called 'outlines'. The reason for this seemingly strange behaviour may be:

    • Circumvent licensing problems (when a certain font disallows its embedding).
    • Impose a handicap upon attempts to extract the text.
    • Accidentally wrong setting in the PDF generating application.
       
  3. The font is embedded as a subset in the PDF file (by the PDF generating software -- users usually do not have much control over the details of this operation) and uses a 'custom' encoding, but the file does not provide a toUnicode table to map the glyphs to characters.

    'Glyphs' are the well-defined shapes in each font drawn on screen. Glyphs map to characters for the computer -- our eyes merely see these shapes and our brains translate these to characters without needing a toUnicode table. Programs like pdftotext require a toUnicode table to reverse the translation of glyphs back to characters.


You can use a command line utility named pdffonts to gain a first insight into the fonts used by your PDF file. Example output:

pdffonts paper-projectiris---final.pdf 

 name                       type         encoding       emb sub uni object ID
 -------------------------- ------------ -------------- --- --- --- ---------
 TCQJEF+CMCSC10             Type 1       Builtin        yes yes no      96  0
 VPAFLY+CMBX12              Type 1       Builtin        yes yes no      97  0
 CWAIXW+CMTI12              Type 1       Builtin        yes yes no      98  0
 OBMDLT+CMR12               Type 1       Builtin        yes yes no      99  0

In this case, text extraction (and your method of grepping for strings) should work:

  • Even though the column named uni (telling if a toUnicode map is embedded in the PDF file) says no for each single font, the encoding column does not contain custom, but builtin (meaning that a glyph->character mapping is provided with the font file, which is of type Type 1.

To sum it up: Without access to your PDF file it is impossible to tell why you cannot "grep" for the strings you are looking for!

Upvotes: 6

Related Questions