Text search and replace in PDF

Question

I'm using an HTML to PDF conversion tool called Docraptor to produce PDFs of reports users can generate from a site I work on. This conversion takes a little bit of time, however. So, I cache the PDFs I generate and use a hex digest of the generated HTML as the cache key.

Recently we decided to add the current date and time to every report. This, of course, means that the HTML (and, thus, the hex digest) changes on every generation of the report even if the content of the report stays the same, and that we'll always have to generate another PDF.

I figured though that, when generating the HTML, I could just put a token in the place of the current date and time (e.g., '__CURRENT_DATE_TIME__') and that I could do a text search and replace within the cached PDF. Unfortunately, the PDF seems to be using an encoding that makes this a little more complex. Here's an example of the encoded text:

\x947+\xbf\xad|H\xf9c\xe5\xcf\x95\xa7\x941\xd5\x1d\xaa\x07US\xaa\xb7\xd4	\xea
u\xbbz\xad\xfa&\xf5w\xd5\x87\xd5\xbfS\xc74\xb9\xe8Om\xc8}\xfe\xbc4w\x07\xb9T\xe1\xe5\xf6\x90\x187\x85r\xff\x90\x1b\xe3\xff\x8d\xfb:y\xe2\xbcL\xb8\x1b9\xe8\x85\x8d\xdc\x14\xff\x0c\xf7\xcd\xab\xf6\xf0o\xf0\xdf\xe3\xae\x05P4\xb3\xe9E\x98\xc5^\x82\x1f\xc0K\xca_+\x92\x95o\xc1\x8b\:\xbc\x87\xf9\xf0\xeb\xbc\x9f\xfb!w/g%\x15\xfcB\xc5
\x8a\x970\xebL \x9f\x0fq\'85\xb7\x0f1\xfe\x84\xd6\xd8\x08\x17\x934\xf8\x8bb\x1d\xbc\x8f\xfa?\xa2\xdc\x8d:]\xcc\x1d\'Op/p\x17\xa1\'\x1f\x83\x87\xb9\xc3p/\xec\x85\x00\xa9D\xeez\xe1I\xf8\x18\xbeF\x0e\xf1"9\x88~\xb7\x03\x8e\xc2\xbbp\xf2\x1c\xb7
\xef\xd9F\xae^e\xe5\xb6\xaaj\xd0B\x87\xc8\xaa\xd8\x8b\~\xecO\x18\xf5\xbf\'7\xc0\xeb\xfc\xc7\xe8\xfb\xeb\xc8
\xe2\x85G\xe1M\xb4\xfao\x88\x8f\xd8\x153
\x1b\xbc\x8c\x99/\x0b\xeeG\xaf\xfd#Lb\x0c\xfe\\x91\x8d\x11\xf4\x11\x1c\xe2}\xb0^q\x12m\xee=\xfb\xb3\x99f\xe5\x18\x7f\x1d9\xc35\xa09SY\xe6^I\xb31\xe6\xe0{0W\xd1<\x9a\x08\xfb\xd0\x130\x8b\xb0\x88\xfe\x13\xfc\x828Q\x8b\xbfV\xbd\x06\xf7\xc1\xed\xf04\x9f\x0c9\xfc#\xdcN.\xc6\xffT!\xc2\xbf\xc0I~9\xeez5\xe6\xa7\x0c\xe2CJC0\x80r\x88\xb1?\xcc<\x8c\x14.\x87*\xa8"\x9b\xc8zh\xc6\x99\xa5\x90\x15\x1bB\xce\x1f\xc5\$\xc56\xc4\xeeUv*=\xf0K\xb2\x9c$\xc3\xb3\x98\xbd\xac\xa8\xc5\xbb\x94\xda\x99\xd3\x88y\x00\xe3\xf0uXJn\x86\xc9\x99^\x98\xc6s\xc5JrH)z\xd3i\xe5V\xe5\x1e\xe5\xe3\xca\x03\xca\x1f*\x7f\xa1Z\x00W`\xd4\xde\x8fV|\x03>\xc4SC$=\xa8\x8bw\xe0o\xe8\xeb\x8d\x18=\x85\x18?
\xc8\xc5R<\xc3\x06\xb9N\xfe\x19h"\xe90\x8290\x0f\xf3v#\xea`=Z2\x8cT\xae\x85[0\x9e\x1e\xc13\xe4\x97\xf0\x01\x11\xc8\x06\xf8!\x1c\xc3\xc8I\xc58\xef\xc1\xfd5H\xa7\x15.F\xab\x87\xe1Q\xcc\x8e\xd7\x91I\x1c\xe9\x85,(@=}L\x12I\x157\x86\xfb\xd1<{\x17\xe6\xd9i\xe4\xe9w\xf0\x07\xcc\x1c1\xc6W!YH\x9a\xd1z=\xf07\x1a\xcb\xb8C\x05\xb4\x93\xfd\xb08v\x10=a\x054\xf3/\xc1\x7fB6\x9e\xae\x8d\x18\xa3\x0f\xe3\xban\xf4\x8dD\xc8\x84j\xe5\x9b\x84\x83\xc2\x99\x15\xb1*n\x80\x7f\x86\xa4\xe0i\x98\x88^\xb5\x16O\xf6Ed\x14\xb90\xa2\x1cg!\x99\xac\x84\xf2\x99%P\x8dg\xecNhW>"IR}\xdd\xa2\xda\x855\xd5U\x95\xe5\xbe\xb2\xd2\x05%\xde\xe2\xa2BOA~^

First, is it reasonable to expect that '__CURRENT_DATE_TIME__' (properly encoded) can be found in there somewhere? If yes, how would I go about encoding that string so that I could just do a simple search and replace?

mkl · Accepted Answer

I could just put a token in the place of the current date and time (e.g., 'CURRENT_DATE_TIME') and that I could do a text search and replace within the cached PDF

Just to give you an idea why this most likely wont work:

In PDFs most often the page content streams (and other streams, too) are stored in a deflated, compressed format. Thus, a normal grep or any comparable text search applied to the file has no chance of finding your placeholder.

Even if you configure your PDF generating software to not compress content streams, you most likely are in trouble, because:
The encoding of strings in the page content not necessarily is a standard ASCII'ish encoding. Especially in case of partially embedded fonts you fairly often see a custom encoding in which the first glyph from that font used in the document is encoded as 0, the second one is encoded as 1, ... Such a custom encoding obviouly breaks your text replacement approach.

Even if you are in a situation where only standard encodings are used, e.g. WinAnsiEncoding, you still likely are in trouble, because:
The operations for text drawing in the page content need not be in the reading order. E.g. your sample place holder may be drawn in three packets, first TIME, then DATE, then CURRENT. This prevents you from recognizing the placeholder.

Even if that does not happen in your case, you may still be in trouble, because:
Even if the parts of your placeholder are drawn in the right order, they may be drawn as separate chunks with numbers inbetween denoting kerning information, i.e. increased or decreased character widths to respect that some letter combinations look better when not printed at standard distance. These information again break your text replacement approach.

If the document in question neither supplies these kerning informations nor uses any of the other options mentioned above, your placeholder quite likely is drawn as one text chunk and can be found by text search.

You may still be in for a surprise, though: If your editing changes the length of the content, you also have to uodate the cross reference informations in the PDF, because many objects in a PDF are referenced by their offset from the document start.

Text search and replace in PDF

Answers (2)

Related Questions