Reputation: 69
I try to parse the Content Stream of a PDF using PDFBox 2.0.0.
Here is a part of the code that handle it :
InputStream is;
try {
is = this.input.getDocumentCatalog().getPages().get(page).getContents();
} catch (IOException e) {
e.printStackTrace();
return;
}
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
do {
try {
line = br.readLine();
} catch (IOException e) {
e.printStackTrace();
try {
br.close();
} catch (IOException e1) {
e1.printStackTrace();
}
return;
}
if(line != null){
System.out.println(line);
}
}while(line != null);
The problem is when I reach a "(someString) Tj" line : here an example of the output my code return :
BT
/F2 7.0866 Tf
0 Tr
7.0866 TL
0.001 Tc
65 Tz
0 0 Td
(
ET
As you can see, the "(someString) Tj" line became "(" ...
In eclipse's debug mode, when the programme reach this line, the "line" variable contain the following value :
"(
(with a " at the beginning and nothing behind the '(', unlike any other string that terminate with a second ").
If I expend the String value, I get the following array of char :
[0] (
[1]
[2] %
[3]
[4] $
[5]
[6]
[7]
[8]
[9] )
[10]T
[11]j
Some of the empty cases return a "void" value (which raise a "Generated value (void) is not compatible with declared type (char)" error in eclipse), other contain some un-understandable characters. I think the problem come from a bad character encoding but I cant find a solution.
I have already tried some things like
line = new String(br.readLine().getBytes("UTF-8"), "UTF-8");
or so, but since I'm not really sure what the problem is, it's really hard to solve it.
Can someone explain to me what the problem is and eventually how to solve it please ?
Thanks for your helps.
Upvotes: 1
Views: 514
Reputation: 95928
Can someone explain to me what the problem is
The problem is that you try to treat the content stream as if it consists of pure textual data in some single standard encoding.
This is wrong.
While indeed the operators and numeric parameters are given in an ASCII'ish form, the content of string parameters of text showing operators may be encoded in ways that are completely unlike ASCII'ish data (let alone UTF-8-encoded ones).
To quote the specification:
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(section 9.4.3 Text-Showing Operators of ISO 32000-1)
If standard encodings are used, these font-specific encodings may remind of ASCII or Latin-1 or similar encodings, but especially in case of partially embedded fonts you often find ad-hoc encodings without any relation to known encodings.
Thus, to properly parse content streams, you have to treat them as binary data and interpret the string operands according to the encoding of the current font at that very position in the content stream.
how to solve it
In PDFBox there are classes that already interpret content streams and try to find Unicode string representations for drawn text.
You, therefore, may want to look at
PDFTextStripper
class, which is the basic PDFBox text extraction class;PDFTextStripper
which present special text extraction problem solutions, e.g. for extraction of text from a given area on the page;PDFTextStripper
is derived from, which present a generic content stream parsing framework; andFrom a follow-up comment of the OP:
I choose this approach to extract the PDF's content because what I want to extract isn't some text but vector-made schemas. The text I try to extract in this particular problem is the variables that are link to specifics parts of the schema. That's why I can't really use 'PDFTextStripper', since I need global information on the vectors that are around the text I extract. But maybe my approach is wrong from the beginning ...
To properly parse those texts, you do have to do something similar to what the text stripper does, and I would propose not to reinvent the wheel.
PDFTextStripper
extends the class PDFTextStreamEngine
which in turn extends PDFStreamEngine
.
PDFStreamEngine
is a class which processes a PDF content stream and executes certain operations; it provides a callback interface for clients that want to do things with the stream.
PDFTextStreamEngine
is the PDFStreamEngine
subclass for advanced processing of text via TextPosition
.
You might want to extend one of the latter two classes for your task and create and register callbacks for vector graphics operations. These callbacks can collect the vector graphics operations you need. The parallel callbacks for textual data provide the variables that are link to specifics parts.
The use of these classes may introduce a certain amount of complexity and you'll will have to study them a bit, but as soon as you have understood their inner workings, they quite likely will turn out to be exactly the base you need.
Upvotes: 3