D.F. Stones
D.F. Stones

Reputation: 91

PDFBox: Invisible text from PdfTextStripper (not clip path or color issue)

File example: test

Here in the 2nd row in the table, after "3500 RENT" there are 2 text tokens("1", "1") returned by PdfTextStripper but actually not visible in the original PDF. I know that it could be a clip path (like in the post here) or a color issue (like in the post here).

However, it looks like in this case it's hidden by some other means... the clip path does not overlap and the color is black for those tokens.

What else could it be?

Upvotes: 0

Views: 392

Answers (1)

mkl
mkl

Reputation: 95918

It is a color issue, the '1's are printed in white.

What makes the situation a bit special is that the ColorSpace in use is not your off-the-shelf DeviceRGB or DeviceGray but a Separation color space, and color values in Separation color spaces are always treated as subtractive colors. Thus, a tint value of 0.0 denotes the lightest color that can be achieved with the given colorant, and 1.0 is the darkest. This convention is the same as for DeviceCMYK color components but opposite to the one for DeviceGray and DeviceRGB.

(cf. ISO 32000-1 section 8.6.6.4 "Separation Colour Spaces")

Inside view

Your content stream starts like this:

/Cs8 cs 1 scn

Cs8 is a Separation color space:

/Cs8 [/Separation /Black [/ICCBased 17 0 R] 18 0 R] 

with an ICCBased alternate space which in turn has DeviceRGB as alternate space

17 0 obj
<<
/Length 2597
/Alternate /DeviceRGB
/Filter /FlateDecode
/N 3
>>
stream
[...ICC profile...]
endstream
endobj 

and a tint transform by samples to the alternate color space

18 0 obj
<<
/Length 779
/BitsPerSample 8
/Decode [0 1 0 1 0 1]
/Domain [0 1]
/Encode [0 254]
/Filter /FlateDecode
/FunctionType 0
/Range [0 1 0 1 0 1]
/Size [255]
>>
stream
[...255 samples from (255,255,255) to (35,31,32)...]
endstream
endobj 

Your content stream continues with operations drawing the headers and the start of the first row and then

/TT2 1 Tf
0 scn
13.559 0 TD
6.8438 Tc
<00140014>Tj
1 scn 

0 scn sets the color to the lightest Cs8 BLACK separation color which is mapped by sample to (255,255,255) on screen which will be pretty white, 6.8438 Tc sets a large character spacing (resulting in the gap between the two '1's), <00140014>Tj draws the two '1's, and 1 scn switches back to the darkest Cs8 BLACK separation color mapped by sample to (35,31,32) on screen which will be a very dark grayish color.

With PDFBox

In a comment you say

when I debug it in processTextPosition(TextPosition text), gs.getNonStrokingColor() has same value for those "1" tokens as for others tokens and is actually black

To recognize this with PDFBox, you have to tell its PDFTextStripper to look for the generic color space selection and color selection operators cs and scn and extend processTextPosition like in this proof-of-concept:

PDFTextStripper stripper = new PDFTextStripper() {
    @Override
    protected void processTextPosition(TextPosition text) {
        PDGraphicsState gs = getGraphicsState();
        PDColor color = gs.getNonStrokingColor();
        float[] currentComponents = color.getComponents();
        if (!Arrays.equals(components, currentComponents)) {
            System.out.print(Arrays.toString(currentComponents));
            components = currentComponents;
        }
        System.out.print(text.getUnicode());
        super.processTextPosition(text);
    }
    
    float[] components;
};

stripper.addOperator(new SetNonStrokingColorSpace());
stripper.addOperator(new SetNonStrokingColorN());

(ExtractText test testTestSeparation)

With these settings in place you get

[1.0]TenantLeaseStart ... 3,500.00RENT[0.0]11[1.0]16,133.33

As you see the color component starts with 1.0, for the two '1's it is 0.0, and thereafter it becomes 1.0 again until the next run of invisible '1's.

Upvotes: 2

Related Questions