D.F. Stones
D.F. Stones

Reputation: 91

PDFBox 2.0: Get color information in TextStripper

I'm using PDFBox PDFTextStripper for text extraction. I also need to get color information for each character, ideally in writeString method. What I found, is this solution for PDFBox 1.8 (actually can be easy converted to 2.0 version), and what else i'm looking for is background color for each character (as in that answer there is only character color). I added all handlers for Fill operators - CloseFillNonZeroAndStrokePath, CloseFillEvenOddAndStrokePath FillNonZeroAndStrokePath, FillEvenOddAndStrokePath, LegacyFillNonZeroRule, FillNonZeroRule, FillEvenOddRule (like suggested in this topic), and inside those operators get nonStrokingColor:

public final class FillEvenOddRule extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
            deleteCharsInPath();
            linePath.reset();
            PDGraphicsState gs = getGraphicsState();    
            PDColor nonStrokingColor = gs.getNonStrokingColor();
            fillColor = nonStrokingColor.toRGB();
        }

        @Override
        public String getName() {
            return "f*";
        }
    }

Then in processTextPosition I tried to get this fillColor and put it to map for each character (assuming content stream work consecutive way - after Fill operator completes, all next coming to processTextPosition characters should have this fillColor. However this is not truth and all characters have wrong color. There is file I'm trying to process, each second row has blue filling, and I would like to get that blue color for each character in such row, and white color for each character in white row. Is it possible with PDFBox?

Upvotes: 0

Views: 1283

Answers (1)

mkl
mkl

Reputation: 96064

The problem in context with the sample document

Then in processTextPosition I tried to get this fillColor and put it to map for each character (assuming content stream work consecutive way - after Fill operator completes, all next coming to processTextPosition characters should have this fillColor. However this is not truth and all characters have wrong color.

As you found out, your assumption is wrong for the PDF at hand. The strategy in this document is to first draw all background material and then draw all text. Thus, your approach for this document should always return the color of the last bit of background material.

As mentioned in my comment to the second question here you referenced, you have to collect all rectangles (or more generically: paths) filled in parallel to the actual text extraction and check whether the font rendering color(s) (depending on the text rendering mode it may also be the StrokingColor!) of the currently inspected text coincide with that of the currently top filled path at the location of the text.

In a comment you wonder

does this mean this approach will work for all documents?

Does this approach work for all documents

For many it does but not for all.

The following issues immediately come to mind:

  • Not all color spaces support the toRGB method you use. (I just checked, I'm positively surprised for how many PDFBox does have an implementation.)

    In particular in case of pattern colors you have to do a lot of digging into the pattern and its usage in your case to find the actual background color(s).

  • There are other ways to paint a background form, too, in particular:

    • The approach only considers filled paths, but if you use a larger value for the graphics state line width or a stretching transformation matrix, a stroked line can also paint rectangular forms. Thus, for this case you also have to consider stroked paths.

    • The background might be a bitmap image. In this case you'll have to analyze the image to get the background color(s)

    • Another alternative to consider is a shading fill. This usually will also result in a range of colors in the background.

  • Forms drawn over the glyph afterwards instead of covering it may change foreground and background considerably. There e.g. are blend modes that take the hue from the backdrop and the saturation from the foreground...

  • Soft masks active when drawing background or foreground may also be of interest.

  • ...

Upvotes: 1

Related Questions