Reputation: 193
I have tested the code provided in this thread. It can find all text elements which are included in an image bounding box. But how can you differ between text behind the image and text above the image ?
Upvotes: 0
Views: 267
Reputation: 193
Below pasted is the code of the old answer mentioned above, ported to PDFBox 2.0.24. Main changes are:
getName()
method addedcontext.processSubStream
replaced with context.showForm
PDXObjectForm
, PDXObjectImage
replaced with the new class names PDFormXObject
, PDImageXObject
.drawer.getResources().getXObjects();
replaced with drawer.getResources().getXObjectNames()
and iteration over the XObjects collection is based on the getXObjectNames()
returned value.public final class CoveredText extends OperatorProcessor
{
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException{
PDFVisibleTextStripper drawer = (PDFVisibleTextStripper)context;
for (COSName objectName: drawer.getResources().getXObjectNames()) {
PDXObject xobject = drawer.getResources().getXObject(objectName);
if ( xobject == null )
{
System.out.println("CoveredText.process Can't find the XObject for '"+objectName.getName()+"'");
}
else if( xobject instanceof PDImageXObject )
{
System.out.println("CoveredText.process " + objectName.getName()+" is a PDImageXObject");
drawer.hide(objectName.getName());
}
else if(xobject instanceof PDFormXObject)
{
PDFormXObject form = (PDFormXObject)xobject;
System.out.println("CoveredText.process " + objectName.getName()+" is a PDFormXObject at localtion " + form.getBBox().toString());
Matrix matrix = form.getMatrix();
if (matrix != null)
{
Matrix xobjectCTM = matrix.multiply( context.getGraphicsState().getCurrentTransformationMatrix());
context.getGraphicsState().setCurrentTransformationMatrix(xobjectCTM);
}
context.showForm(form);
}
}
}
@Override
public String getName() {
return "Do";
}
}
Upvotes: 1