Reputation: 11188
I have two versions of a PDF and I know they're slightly different—the "Reassessment" text in the gray bar, on Page 3:
I'm trying to get the textual difference on my machine.
I used pdfcpu to extract the content from the multi-page PDF and then ran page 3 through the diff
utility:
% diff out_orig/page_3.txt out_new/page_3.txt
1650a1651,1658
> BT
> 1 0 0 rg
> 0 i
> /RelativeColorimetric ri
> /C2_2 9.96 Tf
> 0 Tw 358.147 648.779 Td
> <0035004800440056005600480056005600500048005100570003003000580056005700030032004600460058005500030028005900480055005C0003001600030030005200510057004B0056>Tj
> ET
I've looked up 7.3.4.3 Hexadecimal String in the PDF reference:
A hexadecimal string shall be written as a sequence of hexadecimal digits encoded as ASCII characters and enclosed within angle brackets.
and so I thought I should be able to do something as simple as interpreting the hex characters directly as ASCII text:
>>> s = '0035004800440056005600480056005600500048005100570003003000580056005700030032004600460058005500030028005900480055005C0003001600030030005200510057004B0056'
>>> import binascii
>>> binascii.a2b_hex(s)
b'\x005\x00H\x00D\x00V\x00V\x00H\x00V\x00V\x00P\x00H\x00Q\x00W\x00\x03\x000\x00X\x00V\x00W\x00\x03\x002\x00F\x00F\x00X\x00U\x00\x03\x00(\x00Y\x00H\x00U\x00\\\x00\x03\x00\x16\x00\x03\x000\x00R\x00Q\x00W\x00K\x00V'
but I'm getting garbage. Even without the null bytes:
>>> binascii.a2b_hex(s).replace(b'\x00', b'')
b'5HDVVHVVPHQW\x030XVW\x032FFXU\x03(YHU\\\x03\x16\x030RQWKV'
I expect it to look something like this (in reverse):
>>> binascii.b2a_hex(b'Reassessment Must Occur Every 3 Months')
b'52656173736573736d656e74204d757374204f636375722045766572792033204d6f6e746873'
I found this comment on this somewhat-related SO post:
Literal string (7.3.4.2) - this is pretty much straight-forward, as you just walk the data for "(.?)" * - That's only true for simple examples using standard font encoding. Meanwhile custom encodings for embedded fonts have become very common.
So... maybe that hex string isn't just hex-encoded ASCII?
What am I missing in trying to extract the textual difference?
Upvotes: 3
Views: 2118
Reputation: 46
No it is not ASCII encoding. ASCII encoding is limited to 8 bits.
Multibyte character codes are for pdf Composite Fonts, and specify the glyph to be drawn by its index in the glyph table. Essentially there is no character map. There is a reverse mapping from these glyph indexes to Unicode, to make text searches possible.
The common OpenType font format requires glyph index 0 = .notdef, 1 = .null, 2 = CR and 3 = space(ASCII code 32). Note that 32 - 3 = 29.
So an OpenType composite font created for the ASCII character set, missing non-printing characters 0 to 31 will have the property:
Glyph index + 29 = ASCII
Upvotes: 3
Reputation: 362687
Here we go:
>>> s = '0035004800440056005600480056005600500048005100570003003000580056005700030032004600460058005500030028005900480055005C0003001600030030005200510057004B0056'
>>> ns = [29 + int(c, 16) for c in chunks(s, 4)]
>>> print(bytes(ns))
b'Reassessment Must Occur Every 3 Months'
chunks
is copied from here.
Upvotes: 4