Graham
Graham

Reputation: 3833

What do the ASCII characters preceding a carriage return represent in a PDF page?

This is probably a rather basic question, but I'm having a bit of trouble figuring it out, and it might be useful for future visitors.

I want to get at the raw data inside a PDF file, and I've managed to decode a page using the Python library PyPDF2 with the following commands:

import PyPDF2
with open('My PDF.pdf', 'rb') as infile:
    mypdf = PyPDF2.PdfFileReader(infile)
    raw_data = mypdf.getPage(1).getContents().getData()
    print(raw_data)

Looking at the raw data provided, I have began to suspect that ASCII characters preceding carriage returns are significant: every carriage return that I've seen is preceded with one. It seems like they might be some kind of token identifier. I've already figured out that /RelativeColorimetric is associated with the sequence ri\r. I'm currently looking through the PDF 1.7 standard Adobe provides, and I know an explanation is in there somewhere, but I haven't been able to find it yet in that 756 page behemoth of a document

Upvotes: 1

Views: 464

Answers (1)

Jongware
Jongware

Reputation: 22457

The defining thing here is not that \r – it is just inserted instead of a regular space for readability – but the fact that ri is an operator.

A PDF content stream uses a stack based Polish notation syntax: value1 value2 ... valuen operator

The full syntax of your ri, for example, is explained in Table 57 on p.127:

intent ri (PDF 1.1) Set the colour rendering intent in the graphics state (see 8.6.5.8, "Rendering Intents").

and the idea is that this indeed appears in this order inside a content stream. (... I tried to find an appropriate example of your ri in use but cannot find one; not even any in the ISO PDF itself that you referred to.)

A random stream snippet from elsewhere:

q
  /CS0 cs
  1 1 1 scn
  1.5 i
  /GS1 gs
  0 -85.0500031 -14.7640076 0 287.0200043 344.026001 cm
  BX
  /Sh0 sh
  EX
Q

(the indentation comes courtesy of my own PDF reader) shows operands (/CS0, 1 1 1, 1.5 etc.), with the operators (cs, scn, i etc.) at the end of each line for clarity.

This is explained in 7.8.2 Content Streams:

...
A content stream, after decoding with any specified filters, shall be interpreted according to the PDF syntax rules described in 7.2, "Lexical Conventions." It consists of PDF objects denoting operands and operators. The operands needed by an operator shall precede it in the stream. See EXAMPLE 4 in 7.4, "Filters," for an example of a content stream.
(my emphasis)

7.2.2 Character Set specifies that inside a content stream, whitespace characters such as tab, newline, and carriage return, are just that: separators, and may occur anywhere and in any number (>= 1) between operands and operators. It mentions

NOTE The examples in this standard use a convention that arranges tokens into lines. However, the examples’ use of white space for indentation is purely for clarity of exposition and need not be included in practical use.

– to which I can add that most PDF creating software indeed attempts to delimit 'lines' consisting of an operands-operator sequence with returns.

Upvotes: 1

Related Questions