I'd like to list all objects present in a PDF file: text blocks, images, fonts, page objects, but also vector shapes (if any). I hoped to see all of them with PyMuPDF: import fitz # pip install PyMuPDF doc = fitz.open('test.pdf') for xref in range(1, doc.xref_length()): print(doc.xref_object(xref)) but not everything is there. For example, text is not there. Text can be obtained separately with: print(doc.load_page(0).get_text('dict')) but I'm more looking for a general method, rather than one specific for text elements, one for other objects, etc. Question: how to print all objects present in a PDF file? (text blocks, images, vector shapes, etc.) Notes: I've already read How to extract text from a PDF file? and similar questions but this is specific to text, whereas I'm looking for all objects / attributes. I already read How to open PDF raw? but here it did not help When opening a PDF with a text editor, we see a lot of human-unreadable binary data (it seems that it is not only for images). TL;DR: I'm looking for a representation like: Object0 TYPE:TEXT CONTENT:lorem ipsum POSITION:123,123 Object1 TYPE:IMAGE ... Object2 TYPE:... ...

Reputation: 2841

Extract elements using pdfminer.six

from pdfminer.high_level import extract_pages

for page_layout in extract_pages("package-development.pdf"):
    for element in page_layout:
            print(element)

Here is an excerpt of the output:

<LTCurve 107.618,503.487,189.905,517.616>
<LTCurve 102.197,514.574,108.618,520.488>
<LTLine 742.906,690.178,1075.102,690.178>
<LTLine 185.379,36.023,1076.811,36.023>
<LTTextBoxHorizontal(0) 26.285,763.437,113.660,790.387 '" man/\n'>
<LTTextBoxHorizontal(1) 30.624,741.936,351.528,753.936 'The documentation will become the help pages in your package.\n'>
<LTTextBoxHorizontal(2) 29.272,711.703,329.175,726.553 '☑ Document each function with a roxygen block above its \n'>
<LTTextBoxHorizontal(3) 54.022,700.203,353.671,712.083 'definition. In RStudio, Code > Insert Roxygen Skeleton helps. \n'>
<LTTextBoxHorizontal(4) 29.272,674.803,351.319,689.653 '☑ Document each dataset with roxygen block above the name \n'>
<LTTextBoxHorizontal(5) 54.022,663.303,175.507,675.183 'of the dataset in quotes. \n'>
<LTTextBoxHorizontal(6) 29.272,637.903,299.154,652.753 '☑ Document the package with use_package_doc().\n'>
<LTTextBoxHorizontal(7) 384.050,765.279,450.186,779.279 'ROXYGEN2\n'>
<LTTextBoxHorizontal(8) 378.664,694.106,709.876,754.606 'The roxygen2 package lets you write documentation  \ninline in your .R files with shorthand syntax. \n• Add roxygen documentation as comments beginning with #’.  \n• Place a roxygen @ tag (right) after #’ to supply a specific section \n'>

Upvotes: 1

SargeATM

Reputation: 2841

Bare with me, please.

This isn't an answer but is really a complex comment in response to the overloaded use of the term "object" not only by the OP and commenters, but also by the PDF spec itself.

PDF is really just JSON on steroids

PDF has first-class support for booleans, integers, real numbers, strings, names, arrays, dictionaries, streams, and a singleton null object. But instead of describing the document as one giant dictionary, PDF allows defining objects with an object-id and referencing it later by the object-id. These are called indirect objects. The PDF document is actually just a bag of objects, with an index and pointer to the "root" object at the tail of the file.

INDIRECT OBJECTS

These objects in the PDF that have an object-id is what is typically meant by the informal use of the term objects in a PDF. These are used to describe the structure of the document and all the resources that are needed to produce the document. However these objects hold none of the actual content.

STREAMS hold the content

Streams are used to hold a small postfix-based command language that is interpreted by the PDF viewer. Here is an example from https://brendanzagaeski.appspot.com/0004.html showing an actual valid snippet of PDF that shows an indirect object with object-id 4 and of type stream. My comments on the right.

4 0 obj                 begin indirect object 4
  << /Length 55 >>      { 'Length': 55}
stream                  begin stream type
  BT                        begin-text-object command
    /F1 18 Tf               change-font to font with descriptor F1 at size 18pt
    0 0 Td                  position-text at x=0, y=0
    (Hello World) Tj        render-text "Hello World"
  ET                        end-text-object command
endstream               end stream type
endobj                  end object

GRAPHIC OBJECTS - the twist in the knickers

The PDF spec refers to all of the elements instantiated by commands inside of a stream as "graphic objects". Yes even text objects are graphics objects. However these objects aren't declared with properties, they are defined by instructions on how to build them with an overarching state machine as shown below.

THE PAIN

So the twist, if you want all the graphics objects in the following form:

{ 'content': [
    { 'type': 'text', 'position': [0,0], 'text': "Hello World"
]}

you have to build an interpreter to keep track of the graphics state and store away the objects as they get created when the commands are executed by the interpreter. A basic PDF viewer doesn't have to do this because the interpreter maps closely to the graphics api and the graphics state held by the graphics layer.

So when you say objects...

Do you mean:

Indirect objects
The document catalog in JSON format
All the graphics objects
All of the above

References

All images came out of the PDF specification

https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf

Upvotes: 6

Print all objects inside a PDF file with Python

Answers (3)