Reputation: 91
File example: test
Here in the 2nd row in the table, after "3500 RENT" there are 2 text tokens("1", "1") returned by PdfTextStripper
but actually not visible in the original PDF.
I know that it could be a clip path (like in the post here) or a color issue (like in the post here).
However, it looks like in this case it's hidden by some other means... the clip path does not overlap and the color is black for those tokens.
What else could it be?
Upvotes: 0
Views: 392
Reputation: 95918
It is a color issue, the '1's are printed in white.
What makes the situation a bit special is that the ColorSpace in use is not your off-the-shelf DeviceRGB or DeviceGray but a Separation color space, and color values in Separation color spaces are always treated as subtractive colors. Thus, a tint value of 0.0 denotes the lightest color that can be achieved with the given colorant, and 1.0 is the darkest. This convention is the same as for DeviceCMYK color components but opposite to the one for DeviceGray and DeviceRGB.
(cf. ISO 32000-1 section 8.6.6.4 "Separation Colour Spaces")
Your content stream starts like this:
/Cs8 cs 1 scn
Cs8 is a Separation color space:
/Cs8 [/Separation /Black [/ICCBased 17 0 R] 18 0 R]
with an ICCBased alternate space which in turn has DeviceRGB as alternate space
17 0 obj
<<
/Length 2597
/Alternate /DeviceRGB
/Filter /FlateDecode
/N 3
>>
stream
[...ICC profile...]
endstream
endobj
and a tint transform by samples to the alternate color space
18 0 obj
<<
/Length 779
/BitsPerSample 8
/Decode [0 1 0 1 0 1]
/Domain [0 1]
/Encode [0 254]
/Filter /FlateDecode
/FunctionType 0
/Range [0 1 0 1 0 1]
/Size [255]
>>
stream
[...255 samples from (255,255,255) to (35,31,32)...]
endstream
endobj
Your content stream continues with operations drawing the headers and the start of the first row and then
/TT2 1 Tf
0 scn
13.559 0 TD
6.8438 Tc
<00140014>Tj
1 scn
0 scn
sets the color to the lightest Cs8 BLACK separation color which is mapped by sample to (255,255,255) on screen which will be pretty white, 6.8438 Tc
sets a large character spacing (resulting in the gap between the two '1's), <00140014>Tj
draws the two '1's, and 1 scn
switches back to the darkest Cs8 BLACK separation color mapped by sample to (35,31,32) on screen which will be a very dark grayish color.
In a comment you say
when I debug it in
processTextPosition(TextPosition text)
,gs.getNonStrokingColor()
has same value for those "1" tokens as for others tokens and is actually black
To recognize this with PDFBox, you have to tell its PDFTextStripper
to look for the generic color space selection and color selection operators cs and scn and extend processTextPosition
like in this proof-of-concept:
PDFTextStripper stripper = new PDFTextStripper() {
@Override
protected void processTextPosition(TextPosition text) {
PDGraphicsState gs = getGraphicsState();
PDColor color = gs.getNonStrokingColor();
float[] currentComponents = color.getComponents();
if (!Arrays.equals(components, currentComponents)) {
System.out.print(Arrays.toString(currentComponents));
components = currentComponents;
}
System.out.print(text.getUnicode());
super.processTextPosition(text);
}
float[] components;
};
stripper.addOperator(new SetNonStrokingColorSpace());
stripper.addOperator(new SetNonStrokingColorN());
(ExtractText test testTestSeparation
)
With these settings in place you get
[1.0]TenantLeaseStart ... 3,500.00RENT[0.0]11[1.0]16,133.33
As you see the color component starts with 1.0
, for the two '1's it is 0.0
, and thereafter it becomes 1.0
again until the next run of invisible '1's.
Upvotes: 2