Reputation: 25
I am reading text from PDF using pdfbox library and saving it in text file. It reads hidden text as well which is not visible when PDF is viewed through PDF Reader. My requirement is to get some characteristics of these hidden text which can distinguish it from normal text.
Upvotes: 2
Views: 1681
Reputation: 95918
One possible criterion for the texts to ignore in your example files is the text color, pure CMYK white in one case, 0.753 in a Gray Gamma 2.2 XYZ ICCBased colorspace in the other case.
So let's extend the text stripper by a color filtering option. This in particular means adding operator processors for color setting instructions as the PDFTextStripper
by default ignores them:
public class PDFFilteringTextStripper extends PDFTextStripper {
public interface TextStripperFilter {
public boolean accept(TextPosition text, PDGraphicsState graphicsState);
}
public PDFFilteringTextStripper(TextStripperFilter filter) throws IOException {
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingColorSpace());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorSpace());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingColor());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColor());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingColorN());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorN());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingDeviceGrayColor());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingDeviceGrayColor());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingDeviceRGBColor());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingDeviceRGBColor());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingDeviceCMYKColor());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingDeviceCMYKColor());
this.filter = filter;
}
@Override
protected void processTextPosition(TextPosition text) {
PDGraphicsState graphicsState = getGraphicsState();
if (filter.accept(text, graphicsState))
super.processTextPosition(text);
}
final TextStripperFilter filter;
}
(PDFFilteringTextStripper class)
Using that text stripper class, we can filter the white text from the first example PDF like this:
float[] colorToFilter = new float[] {0,0,0,0};
PDDocument document = ...;
PDFFilteringTextStripper stripper = new PDFFilteringTextStripper((text, gs) -> {
PDColor color = gs.getNonStrokingColor();
return color == null || !((color.getColorSpace() instanceof PDDeviceCMYK) && Arrays.equals(color.getComponents(), colorToFilter));
});
String text = stripper.getText(document);
(ExtractFilteredText test testExtractNoWhiteText...
)
Similarly we can filter the gray text from the second example PDF like this:
float[] colorToFilter = new float[] {0.753f};
PDDocument document = ...;
PDFFilteringTextStripper stripper = new PDFFilteringTextStripper((text, gs) -> {
PDColor color = gs.getNonStrokingColor();
return color == null || !((color.getColorSpace() instanceof PDICCBased) && Arrays.equals(color.getComponents(), colorToFilter));
});
String text = stripper.getText(document);
(ExtractFilteredText test testExtractNoGrayText...
)
In a comment you asked
A quick question- this text in 0.753 in a Gray Gamma 2.2 XYZ ICCBased colorspace - invisible text? Or is it just because of the colorspace, text is not visible in PDF?
It is visible! (Thus, strictly speaking you should not remove it from the extracted text.)
It merely is quite small. On the title page zoom in on the year "2016":
Upvotes: 1