Reputation: 4383
I am unsure if it is a PDFBox issue. But mentioning it might help understand my issue.
So I have been getting a lot of these warnings coming from PDFBox:
WARN No Unicode mapping for a37 (37) in font TCBLZV+LCIRCLE10
This is one out of 100s.
So I decided to add the LCRICLE10 font and other fonts that are mentioned in the warning list.
Here are the fonts I downloaded:
Here is the PDFBox error I am getting:
5517 ERROR Could not load font file: /home/$USER/.fonts/bakoma/pfb/eurb9.pfb
5518 java.io.IOException: Found Token[kind=NAME, text=dup] but expected INTEGER
5519 at org.apache.fontbox.type1.Type1Parser.read(Type1Parser.java:812)
5520 at org.apache.fontbox.type1.Type1Parser.readEncoding(Type1Parser.java:226)
5521 at org.apache.fontbox.type1.Type1Parser.parseASCII(Type1Parser.java:135)
5522 at org.apache.fontbox.type1.Type1Parser.parse(Type1Parser.java:61)
5523 at org.apache.fontbox.type1.Type1Font.createWithPFB(Type1Font.java:56)
5524 at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.addType1Font(FileSystemFontProvider.java:646)
5525 at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.scanFonts(FileSystemFontProvider.java:255)
5526 at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.<init>(FileSystemFontProvider.java:225)
5527 at org.apache.pdfbox.pdmodel.font.FontMapperImpl$DefaultFontProvider.<clinit>(FontMapperImpl.java:130)
5528 at org.apache.pdfbox.pdmodel.font.FontMapperImpl.getProvider(FontMapperImpl.java:149)
5529 at org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:413)
5530 at org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFontBoxFont(FontMapperImpl.java:376)
5531 at org.apache.pdfbox.pdmodel.font.FontMapperImpl.getFontBoxFont(FontMapperImpl.java:350)
5532 at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:146)
5533 at org.apache.pdfbox.pdmodel.font.PDType1Font.<clinit>(PDType1Font.java:79)
5534 at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:62)
5535 at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143)
5536 at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
5537 at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)
5538 at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)
5539 at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)
5540 at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
5541 at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
5542 at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
5543 at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
5544 at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
5545 at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
5546 at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
5547 at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:168)
5548 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
5549 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
5550 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
5551 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:205)
5552 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:486)
5553 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
It's one out of many others.
Here is small list:
ERROR Could not load font file: /home/$USER/.fonts/bakoma/pfb/eufm6.pfb
ERROR Could not load font file: /home/$USER/.fonts/bakoma/pfb/euex9.pfb
ERROR Could not load font file: /home/$USER/.fonts/bakoma/pfb/eusm10.pfb
ERROR Could not load font file: /home/$USER/.fonts/bakoma/pfb/cmmi7.pfb
ERROR Could not load font file: /home/$USER/.fonts/bakoma/pfb/msam6.pfb
They seem to all come from: .fonts/bakoma/pfb/
When I went on FireFox I saw this:
I removed the fonts from ~/.fonts/ and clear the font cache and now everything is back to normal.
Upvotes: 0
Views: 3504
Reputation: 18851
The "WARN No Unicode mapping" messages are only relevant if you do text extraction, i.e. your text will be nothing for that glyph because the unicode mapping is missing. "TCBLZV+LCIRCLE10" indicates an embedded font subset, so adding fonts won't help anyway. See also this: https://pdfbox.apache.org/2.0/faq.html#notext
So your real question ends there, it doesn't get better by loading fonts, unless you'd have trouble with non-embedded fonts.
The error "Found Token[kind=NAME, text=dup] but expected INTEGER" indicates an error parsing a type 1 font. This can be a syntax error in the font, or a bug in the PDFBox type 1 font parser. I rather suspect the later, because type 1 fonts are based on PostScript and the PDFBox parser can recognize only a subset of it.
Update: I looked at the eurb9.pfb font. It is like I suspected, the ASCII part of the font has a calculation ("dup dup 161 10 getinterval 0 exch putinterval dup dup 173 23 getinterval 10 exch putinterval dup dup 127 exch 196 get put readonly def") and we can't parse it. Our own type 1 parser can only parse elements that don't calculate. (This still covers 99% of type1 fonts)
Upvotes: 1