HrkBrkkl
HrkBrkkl

Reputation: 673

How is the text from this pdf encoded?

I have some pdfs with data about machine parts and i am trying to extract sizes. I extracted the text from a pdf via pypdfium2.

import pypdfium2 as pdfium
pdf = pdfium.PdfDocument("myfile.pdf")
page=pdf[1]
textpage = page.get_textpage()

Most of the text is readable but for some reason the important data is not readable when extracted. In the extracted string the relevant part is like this

Readable text \r\n\x13\x0c\x10 \x18\x0c\x18 \x0b\x10\x0e\x10\x15\x18\x0f\x10 \x15\x0c\x10 \x14\x0c\x10 \x14\x0c\x15 readable text

I tried also with tika and PyMuPDF. They only give me the questionmarkcharacter for those parts.

I know the mangled part (\r\n\x13\x0c\x10 \x18\x0c\x18 \x0b\x10\x0e\x10\x15\x18\x0f\x10 \x15\x0c\x10 \x14\x0c\x10 \x14\x0c\x15) should be 3,0 8,8 +0,058/0 5,0 4,0 4,5. My current idea is to make my own encoding table but i wanted to ask if there is a better method and if this looks familiar to someone. I have about 52 files whith around 200 occurences each. While the pdfs are not confidential i dont want to post links because it is not my intelectual property.

Update------------------------------

I tried to find out more about the fonts.

from pdfreader import PDFDocument
fd = open("myfile", "rb")
doc = PDFDocument(fd)
page = next(doc.pages())
font_keys=sorted(page.Resources.Font.keys())

for font_key in font_keys:
    font = page.Resources.Font[font_key]
    print(f"{font_key}: {font.Subtype}, {font.BaseFont}, {font.Encoding}")

gives:

R13: Type0, UHIIUQ+MetaPlusBold-Roman-Identity-H, Identity-H
R17: Type0, EWGLNL+MetaPlusBold-Caps-Identity-H, Identity-H
R20: Type1, NRVKIY+Meta-LightLF, {'Type': 'Encoding', 'BaseEncoding': 'WinAnsiEncoding', 'Differences': [33, 'agrave', 'degree', 39, 'quoteright', 177, 'endash']}
R24: Type0, IKRCND+MetaPlusBold-Italic-Identity-H, Identity-H

-Edit------ I am not interested in help tranlating it manually. I can do that by myself. i am interested in a solution that works by script. For example a script that extracts fonts with codemaps from the pdf and then uses those to translate the unreadable parts

Upvotes: 2

Views: 1202

Answers (2)

Jorj McKie
Jorj McKie

Reputation: 3110

Here is example code to get the source of a font's CMAP with PyMuPDF:

import fitz
doc = fitz.open("some.pdf")
# assume that we know a font's xref already
# extract the xref of its CMAP:
cmap_xref = doc.xref_get_key(xref, "ToUnicode")[1]  # second string is 'nnn 0 R'
if cmap_xref.endswith("0 R"):  # check if a CMAP exists at all
    cxref = int(cmap_xref.split()[0])
else:
    raise ValueError("no CMAP found")
print(doc.xref_stream(cxref).decode())  # convert bytes to string
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
/CMapName/R63 def
1 begincodespacerange
<00><ff>
endcodespacerange
12 beginbfrange
<20><20><0020>
<2e><2e><002e>
<30><31><0030>
<43><46><0043>
<49><49><0049>
<4c><4d><004c>
<4f><50><004f>
<61><61><0061>
<63><69><0063>
<6b><70><006b>
<72><76><0072>
<78><79><0078>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end

Upvotes: 1

K J
K J

Reputation: 11730

This is not uncommon CID CMAP substitution as output in python notation, and is usua;;y specific to a single font with 6 random ID e.g.UHIIUQ+Font name
often found for subsetting fonts that have a limited range of characters.

should be 3,0 8,8 +0,058/0 5,0 4,0 4,5

\r\n\ = cR Nl (windows line feed \x0d\x0a)
\x13 has been mapped to 3
\x0c has been mapped to ,
\x10 has been mapped to 0
 (literal nbsp)
\x18 = 8
\x0c = ,
\x18 = 8
 (literal nbsp)
\x0b has been mapped to +
\x10 = 0
\x0e has been mapped to , (very odd see \x0c)
\x10 = 0
\x15 = 5
\x18 = 8
\x0f has been mapped to /
\x10 = 0
 (literal nbsp)
\x15 etc......................
\x0c
\x10
 
\x14
\x0c
\x10
 
\x14
\x0c
\x15

so \x0# are low order control codes & punctuation
and \x1# are digits

unknown if \x2# are used for letters, the CMAP table should be queried for the full details

\x0e has been mapped to , (very odd see \x0c)
I suspect as its different that should possibly be decimal separator dot ?

Upvotes: 2

Related Questions