How to search a PDF (1.4) byte array for a target string?

Question

I know this is probably a bit unusual, but I'd like to find out if a PDF document (a byte array) contains a particular piece of text. I create the docs myself in Java using the iText library v2.1.7, which produces docs compliant with the PDF 1.4 spec.

My initial naive attempt was something like this:

byte[] target = "the target text".getBytes("UTF-8");
int index = Bytes.indexOf(pdfBytes, target); // Guava lib
System.out.println( index ); // always -1 (not found)

I just don't understand enough about how these types of documents are encoded to figure out what I need to do. I suppose what I really need to find out is what kind of encoding I need to use on the target text when I'm converting to bytes, so that it will match what the PDF uses.

I created a small sample PDF document which contains nothing except a phrase with the words one two three four five. This is what the contents of that PDF file look like if I cat the file in a Linux terminal (or use vim to view it):

%PDF-1.4
%����
2 0 obj
<>stream
x�+�r
�24U�02I�2P0Q�n�
�F
!i\�y�
%��
%E��
i��E
i�e��!Y0Ů!\�\���
endstream
endobj
4 0 obj
<>>>/Parent 3 0 R/MediaBox[0 0 595 842]>>
endobj
1 0 obj
<>
endobj
3 0 obj
<>
endobj
5 0 obj
<>
endobj
6 0 obj
<>
endobj
xref
0 7
0000000000 65535 f 
0000000309 00000 n 
0000000015 00000 n 
0000000397 00000 n 
0000000152 00000 n 
0000000460 00000 n 
0000000505 00000 n 
trailer
<<7bf1bdf9e8d048c5795c7785954d9360>]/Root 5 0 R/Size 7>>
startxref
615
%%EOF

Some of those character encodings have not translated properly in the copy-and-paste, so if you copy and save what you see there, you'll get a corrupt PDF. Here's a link to a copy of that PDF.

I've tried encoding my target string to various encodings such as CP-1252 and WinAnsiEncoding, but these are unrecognized character sets.

I didn't think this would cause me much trouble initially, but I haven't yet been able to figure out how to do this. I do have a workaround that gets me the same result, but it's a solution that's specifically intended for the iText library i.e. not a general solution for searching text in a PDF byte array.

If I use iText to parse the byte array that I want to search, I can iterate over each page of the PDF and extract the text:

private static boolean doesPDFContain(byte[] pdf, String text) throws Exception {
    PdfReader reader = new PdfReader(pdf);
    int numPages = reader.getNumberOfPages();
    PdfTextExtractor extractor = new PdfTextExtractor(reader);

    for (int i=1; i<=numPages; i++) {
        if ( extractor.getTextFromPage(i).contains(text) ){
            return true;
        }
    }
    return false;
}

I'd still be interested in hearing if it's possible to do what I was originally attempting.

mkl · Accepted Answer

There are a number of reasons why your naive approach --- simply looking for the text in a specific encoding --- in general won't work.

The text you are looking for, text displayed on screen, is drawn by text drawing instructions in some content stream. (Let's ignore the cases of graphics looking like text but being drawn using vector or bitmap graphics commands and of missing or inaccurate font encoding information.)

The text you are looking for is not necessarily drawn by a single instruction. The text "Hello", for example, might be written using two consecutive commands:
```
(Hel) Tj (lo) Tj
```
The different commands need not even follow each other in the content stream, they may be spread across it.
Each font in PDFs can use a different encoding for its strings, and these encodings don't even need to be standard encodings, they may be ad-hoc encodings created on the fly by the PDF creator program.
The content stream can (and usually does) require a filter for decoding, e.g. in the PDF above the content stream in the object 2 requires FlateDecode filtering (essentially: unzipping).
The PDF may be encrypted (in which case more specifically strings and streams are encrypted); even PDFs you can open without any further ado in your PDF viewer may be encrypted using a default password (this technique is used for encoding permissions).

Thus, to inspect the contents of content streams, you may have to

decrypt the file; then
decode the content streams with their respectively applicable filters; then
parse the content stream instructions to know for each text drawing instruction
- which font is used to draw the text and
- at which position the text is drawn; then
decode the string contents according to the information in the font; then
sort the text pieces according to the position information and put them together as a single string.

In this character string you can eventually search the text in a naive way.

How to search a PDF (1.4) byte array for a target string?

Answers (1)

Related Questions