Hub
Hub

Reputation: 25

PDFBox "Symbolic fonts must have a built-in encoding" error when using PDFTextStripper.getText()

I'm using Apache PDFBox 2.0.2. Loading pdf documents from web to get a text inside.

URL u = new URL("url/to/file.pdf");
PDDocument pddDocument = PDDocument.load(u.openStream());

PDFTextStripper textStripper = new PDFTextStripper();
String doc = textStripper.getText(pddDocument);

The problem is, sometimes I got IllegalArgumentException: "Symbolic fonts must have a built-in encoding" and can't extract text from the PDF.

Please help.

Upvotes: 1

Views: 557

Answers (1)

mkl
mkl

Reputation: 95898

As already indicated by @Tilman opening a bug issue in the PDFBox Jira, this behavior is a bug:

The DictionaryEncoding constructor retrieves an Encoding instance for the base encoding of a font using Encoding.getInstance and is well aware that this method may return null:

base = Encoding.getInstance(name); // may be null

If it is null, though, and PDFBox has not been able to determine a built-in encoding of the font, the observed exception is thrown:

throw new IllegalArgumentException("Symbolic fonts must have a built-in " + 
                                   "encoding");

In the case at hand, the base encoding is MacExpertEncoding which is one of the possible base encodings explicitly named by the PDF specification. Unfortunately Encoding.getInstance does not know this encoding and, therefore, returns null which in turn triggers the exception as PDFBox also could not identify a built-in encoding.


Thus, a fix should include the addition of an Encoding class for MacExpertEncoding and extending Encoding.getInstance accordingly.

Furthermore, one should consider not throwing the exception at all: There are numerous situation where there is no need for an implicit or explicit base encoding, e.g. if the Differences explicitly provide a mapping for each character code or (in case of pure text extraction) if the font has a good ToUnicode table.

Upvotes: 1

Related Questions