Reputation: 3833
This is probably a rather basic question, but I'm having a bit of trouble figuring it out, and it might be useful for future visitors.
I want to get at the raw data inside a PDF file, and I've managed to decode a page using the Python library PyPDF2
with the following commands:
import PyPDF2
with open('My PDF.pdf', 'rb') as infile:
mypdf = PyPDF2.PdfFileReader(infile)
raw_data = mypdf.getPage(1).getContents().getData()
print(raw_data)
Looking at the raw data provided, I have began to suspect that ASCII characters preceding carriage returns are significant: every carriage return that I've seen is preceded with one. It seems like they might be some kind of token identifier. I've already figured out that /RelativeColorimetric
is associated with the sequence ri\r
. I'm currently looking through the PDF 1.7 standard Adobe provides, and I know an explanation is in there somewhere, but I haven't been able to find it yet in that 756 page behemoth of a document
Upvotes: 1
Views: 464
Reputation: 22457
The defining thing here is not that \r
– it is just inserted instead of a regular space for readability – but the fact that ri
is an operator.
A PDF content stream uses a stack based Polish notation syntax: value1 value2 ... valuen operator
The full syntax of your ri
, for example, is explained in Table 57 on p.127:
intent ri (PDF 1.1) Set the colour rendering intent in the graphics state (see 8.6.5.8, "Rendering Intents").
and the idea is that this indeed appears in this order inside a content stream. (... I tried to find an appropriate example of your ri
in use but cannot find one; not even any in the ISO PDF itself that you referred to.)
A random stream snippet from elsewhere:
q
/CS0 cs
1 1 1 scn
1.5 i
/GS1 gs
0 -85.0500031 -14.7640076 0 287.0200043 344.026001 cm
BX
/Sh0 sh
EX
Q
(the indentation comes courtesy of my own PDF reader) shows operands (/CS0
, 1 1 1
, 1.5
etc.), with the operators (cs
, scn
, i
etc.) at the end of each line for clarity.
This is explained in 7.8.2 Content Streams:
...
A content stream, after decoding with any specified filters, shall be interpreted according to the PDF syntax rules described in 7.2, "Lexical Conventions." It consists of PDF objects denoting operands and operators. The operands needed by an operator shall precede it in the stream. See EXAMPLE 4 in 7.4, "Filters," for an example of a content stream.
(my emphasis)
7.2.2 Character Set specifies that inside a content stream, whitespace characters such as tab, newline, and carriage return, are just that: separators, and may occur anywhere and in any number (>= 1) between operands and operators. It mentions
NOTE The examples in this standard use a convention that arranges tokens into lines. However, the examples’ use of white space for indentation is purely for clarity of exposition and need not be included in practical use.
– to which I can add that most PDF creating software indeed attempts to delimit 'lines' consisting of an operands-operator sequence with returns.
Upvotes: 1