Reputation: 13
I am using the pdfclown library to highlight some text inside the pdf file but for some reason, I get nullpointerexception error when I run TextHighlightSample.
[java] java.lang.NullPointerException
[java] at java.util.Hashtable.hash(Hashtable.java:239)
[java] at java.util.Hashtable.put(Hashtable.java:519)
[java] at org.pdfclown.documents.contents.fonts.SimpleFont.onLoad(SimpleFont.java:139)
[java] at org.pdfclown.documents.contents.fonts.Font.load(Font.java:738)
[java] at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:351)
[java] at org.pdfclown.documents.contents.fonts.SimpleFont.<init>(SimpleFont.java:62)
[java] at org.pdfclown.documents.contents.fonts.TrueTypeFont.<init>(TrueTypeFont.java:68)
[java] at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:253)
[java] at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:72)
[java] at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
[java] at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:119)
[java] at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
[java] at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
[java] at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
[java] at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1330)
[java] at org.pdfclown.documents.contents.ContentScanner$TextWrapper.extract(ContentScanner.java:811)
[java] at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:777)
[java] at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:770)
[java] at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.get(ContentScanner.java:690)
[java] at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.access$0(ContentScanner.java:682)
[java] at org.pdfclown.documents.contents.ContentScanner.getCurrentWrapper(ContentScanner.java:1154)
[java] at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:633)
[java] at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:647)
[java] at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:647)
[java] at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:296)
[java] at org.pdfclown.samples.cli.TextHighlightSample.run(TextHighlightSample.java:56)
[java] at org.pdfclown.samples.cli.SampleLoader.run(SampleLoader.java:140)
[java] at org.pdfclown.samples.cli.SampleLoader.main(SampleLoader.java:56)
Does anyone know how to solve this problem?
Upvotes: 1
Views: 862
Reputation: 96029
The foreground issue is that PdfClown in SimpleFont.onLoad()
(while reading the Widths from the font dictionary into its own structures) assumes that it has a glyphIndexes
entry for each codes
value for a key from the FirstChar-based indices in the Widths array:
if(glyphWidthObjects != null)
{
ByteArray charCode = new ByteArray(
new byte[]
{(byte)((PdfInteger)getBaseDataObject().get(PdfName.FirstChar)).getIntValue()}
);
for(PdfDirectObject glyphWidthObject : glyphWidthObjects)
{
int glyphWidth = ((PdfNumber<?>)glyphWidthObject).getIntValue();
if(glyphWidth > 0)
{
Integer code = codes.get(charCode);
if(code != null)
{
glyphWidths.put(
glyphIndexes.get(code), //<<<<<<<<<<<<<<<<<<<<<<
glyphWidth
);
}
}
charCode.data[0]++;
}
}
If you check for null
here, e.g. replacing
if(code != null)
by
if(code != null && glyphIndexes.get(code) != null)
you will get rid of the NullPointerException
.
Usually there are glyphIndexes
entries for all those values. Thus, usually you don't get the NullPointerException
here. But PdfClown in its attempt to be able to extract as much as possible uses a mixture of information from the PDF objects and the embedded font objects, and there still seem to be some shortcomings in the coordination of those information, e.g. in case of your document:
While constructing a TrueTypeFont
object for the font SourceSansPro-Regular PdfClown
Font.load
) tries to read a ToUnicode map to get a mapping from character codes to Unicode and put it into codes
; unfortunately the font has no ToUnicode map; thus, codes
remains null
;OpenFontParser
construction in TrueTypeFont.loadEncoding
initially called by SimpleFont.onLoad
) tries to read information from the embedded font file; among other data it retrieved a mapping 32..213 -> 0..44 mapping character codes to in-font glyph indices;TrueTypeFont.loadEncoding
initially called by SimpleFont.onLoad
) sets the font object's glyphIndexes
member to that map; if there was a codes
mapping already now, this would be used here to change the mapping to a mapping Unicode -> 0..44; but codes
is null
(see above), so glyphIndexes
remains as is;TrueTypeFont.loadEncoding
initially called by SimpleFont.onLoad
) as there is no codes
mapping yet, it creates one based on the MacRomanEncoding entry from the PDF font dictionary;TrueTypeFont.loadEncoding
initially called by SimpleFont.onLoad
) if there were no glyphIndexes
yet, it would derive one from the current codes
mapping and the Widths array; but we already have one, so it remains as is;SimpleFont.onLoad
) finally it tries to put the contents of the PDF font dictionary's Widths array into its glyphWidths
map. The code (see above) assumes that glyphIndexes
is a mapping of Unicode codes and, therefore, translates them using codes
first. Unfortunately glyphIndexes
here is not from Unicode codes but from character codes. Thus the failure observed above occurs.Font extraction in PdfClown 0.1.3 is in need of clean-up. It tries to make use of information from both the PDF objects and the embedded fonts (which is a good idea) but for some situations like here shoots itself in the foot.
But it's still an early 0.x version after all, so some issues are to be expected...
Upvotes: 2