Reputation: 25
I'm using Apache PDFBox 2.0.2. Loading pdf documents from web to get a text inside.
URL u = new URL("url/to/file.pdf");
PDDocument pddDocument = PDDocument.load(u.openStream());
PDFTextStripper textStripper = new PDFTextStripper();
String doc = textStripper.getText(pddDocument);
The problem is, sometimes I got IllegalArgumentException: "Symbolic fonts must have a built-in encoding" and can't extract text from the PDF.
Please help.
Upvotes: 1
Views: 557
Reputation: 95898
As already indicated by @Tilman opening a bug issue in the PDFBox Jira, this behavior is a bug:
The DictionaryEncoding
constructor retrieves an Encoding
instance for the base encoding of a font using Encoding.getInstance
and is well aware that this method may return null
:
base = Encoding.getInstance(name); // may be null
If it is null
, though, and PDFBox has not been able to determine a built-in encoding of the font, the observed exception is thrown:
throw new IllegalArgumentException("Symbolic fonts must have a built-in " +
"encoding");
In the case at hand, the base encoding is MacExpertEncoding which is one of the possible base encodings explicitly named by the PDF specification. Unfortunately Encoding.getInstance
does not know this encoding and, therefore, returns null
which in turn triggers the exception as PDFBox also could not identify a built-in encoding.
Thus, a fix should include the addition of an Encoding
class for MacExpertEncoding and extending Encoding.getInstance
accordingly.
Furthermore, one should consider not throwing the exception at all: There are numerous situation where there is no need for an implicit or explicit base encoding, e.g. if the Differences explicitly provide a mapping for each character code or (in case of pure text extraction) if the font has a good ToUnicode table.
Upvotes: 1