Reputation: 1
I am currently confronted with an issue related to the processing of PDF files generated through Ghostscript. Specifically, when attempting to extract text from these PDFs using pdfminer and fitz, I am encountering a RuntimeError accompanied by the message 'pdf device does not support type 3 fonts.' This error has introduced significant disruptions to my workflow.
I am seeking input from fellow community members who might have encountered a similar issue. If you have faced this specific problem or something analogous, I kindly request your insights on how you successfully resolved it or any effective workarounds you employed. Your comprehensive explanations would be greatly appreciated, as I am actively seeking a resolution to this challenge."
I uses the pdfminer.six package v 20221105
Upvotes: 0
Views: 422
Reputation: 11867
Type 3 fonts can be extractable in some cases such as here (7 line fonts of Type3 and 1 of type 1), but not easily since they are often a custom encoding. So see how the extraction on the left would need recoding to the numeric styles in the body (Just like in Caesar's Roman Times encryption, but clearly NOT Times Roman Font :-) However the Type 1 is CMR10
which is Computer Modern Roman. Restructured output is "Doable" but specific to that document.
This is a very simple recoded format in a simple usage, most may be more complex, and be simply bitmaps / outline shapes i.e. not plain text.
BT
/F51 11.9552 Tf
1 0 0 1 91.925 710.037 Tm
[(012345678)-866.00009(@ABCDEFGH)-867.00009(PQRSTUVWX)-867.00009(`abcdefgh)] TJ
/F52 11.9552 Tf
1 0 0 1 91.925 695.093 Tm
[(012345678)-933(@ABCDEFGH)-933(PQRSTUVWX)-934.00009(`abcdefgh)] TJ
/F53 11.9552 Tf
1 0 0 1 91.925 680.14907 Tm
[(012345678)-1000.0001(@ABCDEFGH)-1000.0001(PQRSTU)-1(VWX)-1000.0001(`abcdefgh)] TJ
/F54 11.9552 Tf
1 0 0 1 91.925 665.2051 Tm
[(012345678)-1067(@ABCDEFGH)-1067(PQRSTUVWX)-1066(`abcdefgh)] TJ
/F55 11.9552 Tf
1 0 0 1 91.925 650.2611 Tm
[(012345678)-1133.0001(@ABCD)-1(EFGH)-1133.0001(PQRSTUVWX)-1133.0001(`abcdefgh)] TJ
/F56 11.9552 Tf
1 0 0 1 91.925 635.3181 Tm
[(012345678)-1200(@ABCDEFGH)-1200(PQRSTUVWX)-1200(`abcdefgh)] TJ
/F57 11.9552 Tf
1 0 0 1 91.925 620.37417 Tm
[(012345678)-1267(@ABCDEFGH)-1266(PQRSTUVWX)-1267(`abcdefgh)] TJ
/F1 9.9626 Tf
1 0 0 1 303.509 55.29016 Tm
[(1)] TJ
ET
So for an example, HTML extraction will need much parsing to make output, similar to source, and "Find And Replace" during conversion will easily work in this case, by provide alternate coding and Font substitutions for parts of lines.
Upvotes: 0