Beku Ch
Beku Ch

Reputation: 79

How can I edit text in pdf that is encoded in hexadecimal format?

I'm trying to find and replace certain text with specific value in PDF. I am using python library pdfrw, since my preferred environment is python. Following is example content in first page of the document.

BT\n/F8 40 Tf\n1 0 0 -1 569 376 Tm\n<0034> Tj\n26 0 Td <0028> Tj\n22 0 Td <0032> Tj\n25 0 Td <0031> Tj\n32 0 Td <0034> Tj\n26 0 Td <0036> Tj\nET\n0 .8863 1 RG

which corresponds to word "REPORT" in the document. So far I've understood meaning of all the special tags and numbers in this format and successfully manipulated position and some characters in it. But I do not understand in what format or encoding each chars are encoded (<0034>, <0028> etc).

I tried brute forcing every single combination of <00xx> but only found valid match for letters R,E,P,O,T, which are letters used in the word. I tried same for F11 and F10 which are included in page and found same result where I matched letters that are only used. If anyone can explain how this encoding works and how can I edit it such that I will be able to insert any utf-8 character, that would be very helpful.

Thank you.

note-1: following is F8 object:

{'/Subtype': '/Type0', '/Type': '/Font', '/BaseFont': '/OpenSans-Bold', '/Encoding': '/Identity-H', '/DescendantFonts': [{'/DW': '0', '/Subtype': '/CIDFontType2', '/CIDSystemInfo': {'/Supplement': '0', '/Registry': '(Adobe)', '/Ordering': '(Identity)'}, '/Type': '/Font', '/FontDescriptor': {'/Descent': '-292.96875', '/CapHeight': '713.86719', '/StemV': '83.984375', '/Type': '/FontDescriptor', '/FontFile2': {'/Length1': '5540', '/Length': '5540'}, '/Flags': '4', '/FontName': '/OpenSans-Bold', '/ItalicAngle': '0', '/FontBBox': ['-619.14063', '-292.96875', '1318.84766', '1068.84766'], '/Ascent': '1068.84766'}, '/BaseFont': '/OpenSans-Bold', '/W': ['0', ['600.09766'], '40', ['560.05859'], '49', ['795.89844', '627.92969', '0', '660.15625', '0', '579.10156']], '/CIDToGIDMap': '/Identity'}], '/ToUnicode': {'/Length': '413'}}

note2: Also replacing text in (nice text)Tj\n or (<0032><0032>) fashion does not works here.

Upvotes: 2

Views: 3543

Answers (3)

Beku Ch
Beku Ch

Reputation: 79

So as previous answers pointed out the embedded Font in the document was only a subset, and encodings were referencing characters that is unknown to me. I solved the issue by first creating temporary pdf which contains every letters in the alphabet (which contains font information that i need) and replacing resource font of original file with that of my new file. And then I can easily manipulate the text in same way as my temporary file like so

target.pages[0].Resources.Font=font_pdf.pages[0].Resources.Font
target.pages[0].Contents.stream.replace(
    "BT\n/F8 40 Tf\n1 0 0 -1 569 376 Tm\n<0034> Tj\n26 0 Td <0028> Tj\nET", 
    f"BT\n/F0 11 Tf\n1 0 0 -1 500 500 Tm\n(\x02Y\x02Q) Tj\nET"
)

Thank you all :)

note: I still don't have good solution for decoding hexadecimals using its own font. So I decided to use pattern matching, since I know what text should be expected. Better solutions would be very helpful

Upvotes: 1

Olivier
Olivier

Reputation: 18132

'/Encoding': '/Identity-H' and '/CIDToGIDMap': '/Identity' mean that the character code corresponds to the glyph id. So <0034> shows the glyph number 0x34 from the selected font.

If the font has been subsetted, you have only access to the glyphs that have been included in the subset.

'/Length': '5540' means that the font size is 5540 bytes, which clearly means it is subsetted.

Upvotes: 0

Arty
Arty

Reputation: 16747

In general I think pdf text can be compressed/encoded by different algorithms hence pdfrw doesn't decode text by itself. So you can't know what is the correct way in general, 'cause it is different for each case. I've tried simple pdf from here and it contains just plain text inside.

Probably you didn't figure out what is the correct correspondence between characters and hex codes is due to the fact that it may be a compressed stream - it means each code depends on the position of character in whole stream plus on the value of all previous characters. E.g. text may be zlib compressed.

Also pdf text is a sequence of commands for positioning/formatting/outputing text, so in general you have to be able to decode/encode all these commands to be able to process really any text. Your format may contain symbol table where all used symbols are mapped to hex value. To figure out correct mapping all symbols should be present in example text.

For your case you might probably use next table, for conversion, I use the fact that letter R has hex value 0x34:

Try it online!

import sys
for i, n in enumerate(range(32, 128)):
    sys.stdout.write(f"{hex(n - ord('R') + 0x34).ljust(4)}: '{chr(n)}' ")
    if (i + 1) % 8 == 0:
        sys.stdout.write('\n')

Output:

0x2 : ' ' 0x3 : '!' 0x4 : '"' 0x5 : '#' 0x6 : '$' 0x7 : '%' 0x8 : '&' 0x9 : ''' 
0xa : '(' 0xb : ')' 0xc : '*' 0xd : '+' 0xe : ',' 0xf : '-' 0x10: '.' 0x11: '/' 
0x12: '0' 0x13: '1' 0x14: '2' 0x15: '3' 0x16: '4' 0x17: '5' 0x18: '6' 0x19: '7' 
0x1a: '8' 0x1b: '9' 0x1c: ':' 0x1d: ';' 0x1e: '<' 0x1f: '=' 0x20: '>' 0x21: '?' 
0x22: '@' 0x23: 'A' 0x24: 'B' 0x25: 'C' 0x26: 'D' 0x27: 'E' 0x28: 'F' 0x29: 'G' 
0x2a: 'H' 0x2b: 'I' 0x2c: 'J' 0x2d: 'K' 0x2e: 'L' 0x2f: 'M' 0x30: 'N' 0x31: 'O' 
0x32: 'P' 0x33: 'Q' 0x34: 'R' 0x35: 'S' 0x36: 'T' 0x37: 'U' 0x38: 'V' 0x39: 'W' 
0x3a: 'X' 0x3b: 'Y' 0x3c: 'Z' 0x3d: '[' 0x3e: '\' 0x3f: ']' 0x40: '^' 0x41: '_' 
0x42: '`' 0x43: 'a' 0x44: 'b' 0x45: 'c' 0x46: 'd' 0x47: 'e' 0x48: 'f' 0x49: 'g' 
0x4a: 'h' 0x4b: 'i' 0x4c: 'j' 0x4d: 'k' 0x4e: 'l' 0x4f: 'm' 0x50: 'n' 0x51: 'o' 
0x52: 'p' 0x53: 'q' 0x54: 'r' 0x55: 's' 0x56: 't' 0x57: 'u' 0x58: 'v' 0x59: 'w' 
0x5a: 'x' 0x5b: 'y' 0x5c: 'z' 0x5d: '{' 0x5e: '|' 0x5f: '}' 0x60: '~' 0x61: '' 

code for conversion from your hex to char is simple:

hex_val = '0030'
print(chr(int(hex_val, 16) - 0x34 + ord('R')))

If you have some more fancy mapping between chars and hex value then you just have to create a text with all possible chars, then convert it using your convertor, see what hex is inside for each letter.

Also I've just tried to figure out how is text encoded inside PDF, what commands are used and it looks like string with Tj command at the end contains text itself. Hence I wrote pdf text modifier in the code below, it accepts file name or URL as first arg and output file name as second, or just run it to use default example, needed replacements are listed in beginning of script as changes variable.

But next modifier doesn't decode your hex format. It is just handy for replacing any text encoded in plain.

Try it online!

import sys, os, io
# Needs: python -m pip install pdfrw
from pdfrw import PdfReader, PdfWriter

changes = {'And': 'Or', 'text': 'string'}

def ReplaceText(text, reps = {}):
    res, in_block = '', False
    for line in text.splitlines():
        line = line.strip()
        nline = line
        if line == 'BT':
            in_block = True
        elif line == 'ET':
            in_block = False
        elif in_block:
            cmd = line.rpartition(' ')[2]
            if cmd.lower() == 'tj':
                for k, v in reps.items():
                    nline = nline.replace(k, v)
        res += nline + '\n'
    return res

ifn = sys.argv[1] if len(sys.argv) > 1 else 'http://www.africau.edu/images/default/sample.pdf'
ofn = (ifn[:ifn.rfind('.')] + '.processed.pdf') if len(sys.argv) <= 2 else sys.argv[2]

if ifn.lower().startswith('http'):
    # Needs: python -m pip install requests
    import requests
    ofn = (ifn[ifn.rfind('/') + 1:] + '.processed.pdf') if len(sys.argv) <= 2 else sys.argv[2]
    ifn = io.BytesIO(requests.get(ifn).content)
    
r = PdfReader(ifn)
for page in r.pages:
    page.Contents.stream = ReplaceText(page.Contents.stream, changes)

PdfWriter(ofn, trailer = r).write()

Upvotes: 0

Related Questions