Reputation: 6524
I know this is probably a bit unusual, but I'd like to find out if a PDF document (a byte array) contains a particular piece of text. I create the docs myself in Java using the iText library v2.1.7, which produces docs compliant with the PDF 1.4 spec.
My initial naive attempt was something like this:
byte[] target = "the target text".getBytes("UTF-8");
int index = Bytes.indexOf(pdfBytes, target); // Guava lib
System.out.println( index ); // always -1 (not found)
I just don't understand enough about how these types of documents are encoded to figure out what I need to do. I suppose what I really need to find out is what kind of encoding I need to use on the target text when I'm converting to bytes, so that it will match what the PDF uses.
I created a small sample PDF document which contains nothing except a phrase with the words one two three four five
. This is what the contents of that PDF file look like if I cat
the file in a Linux terminal (or use vim
to view it):
%PDF-1.4
%����
2 0 obj
<</Filter/FlateDecode/Length 71>>stream
x�+�r
�24U�02I�2P0Q�n�
�F
!i\�y�
%��
%E��
i��E
i�e��!Y0Ů!\�\���
endstream
endobj
4 0 obj
<</Contents 2 0 R/Type/Page/Resources<</ProcSet [/PDF /Text /ImageB /ImageC /ImageI]/Font<</F1 1 0 R>>>>/Parent 3 0 R/MediaBox[0 0 595 842]>>
endobj
1 0 obj
<</Subtype/Type1/Type/Font/BaseFont/Helvetica/Encoding/WinAnsiEncoding>>
endobj
3 0 obj
<</Kids[4 0 R]/Type/Pages/Count 1/ITXT(2.1.7)>>
endobj
5 0 obj
<</Type/Catalog/Pages 3 0 R>>
endobj
6 0 obj
<</ModDate(D:20171216101023Z)/CreationDate(D:20171216101023Z)/Producer(iText 2.1.7 by 1T3XT)>>
endobj
xref
0 7
0000000000 65535 f
0000000309 00000 n
0000000015 00000 n
0000000397 00000 n
0000000152 00000 n
0000000460 00000 n
0000000505 00000 n
trailer
<</Info 6 0 R/ID [<9e1d205d229e3d1b5b56354a7da26844><7bf1bdf9e8d048c5795c7785954d9360>]/Root 5 0 R/Size 7>>
startxref
615
%%EOF
Some of those character encodings have not translated properly in the copy-and-paste, so if you copy and save what you see there, you'll get a corrupt PDF. Here's a link to a copy of that PDF.
I've tried encoding my target string to various encodings such as CP-1252 and WinAnsiEncoding, but these are unrecognized character sets.
I didn't think this would cause me much trouble initially, but I haven't yet been able to figure out how to do this. I do have a workaround that gets me the same result, but it's a solution that's specifically intended for the iText library i.e. not a general solution for searching text in a PDF byte array.
If I use iText to parse the byte array that I want to search, I can iterate over each page of the PDF and extract the text:
private static boolean doesPDFContain(byte[] pdf, String text) throws Exception {
PdfReader reader = new PdfReader(pdf);
int numPages = reader.getNumberOfPages();
PdfTextExtractor extractor = new PdfTextExtractor(reader);
for (int i=1; i<=numPages; i++) {
if ( extractor.getTextFromPage(i).contains(text) ){
return true;
}
}
return false;
}
I'd still be interested in hearing if it's possible to do what I was originally attempting.
Upvotes: 4
Views: 3298
Reputation: 96064
There are a number of reasons why your naive approach --- simply looking for the text in a specific encoding --- in general won't work.
The text you are looking for, text displayed on screen, is drawn by text drawing instructions in some content stream. (Let's ignore the cases of graphics looking like text but being drawn using vector or bitmap graphics commands and of missing or inaccurate font encoding information.)
The text you are looking for is not necessarily drawn by a single instruction. The text "Hello", for example, might be written using two consecutive commands:
(Hel) Tj (lo) Tj
The different commands need not even follow each other in the content stream, they may be spread across it.
Each font in PDFs can use a different encoding for its strings, and these encodings don't even need to be standard encodings, they may be ad-hoc encodings created on the fly by the PDF creator program.
The content stream can (and usually does) require a filter for decoding, e.g. in the PDF above the content stream in the object 2 requires FlateDecode filtering (essentially: unzipping).
The PDF may be encrypted (in which case more specifically strings and streams are encrypted); even PDFs you can open without any further ado in your PDF viewer may be encrypted using a default password (this technique is used for encoding permissions).
Thus, to inspect the contents of content streams, you may have to
In this character string you can eventually search the text in a naive way.
Upvotes: 5