Reputation: 46493
I'd like to list all objects present in a PDF file: text blocks, images, fonts, page objects, but also vector shapes (if any).
I hoped to see all of them with PyMuPDF:
import fitz # pip install PyMuPDF
doc = fitz.open('test.pdf')
for xref in range(1, doc.xref_length()):
print(doc.xref_object(xref))
but not everything is there. For example, text is not there. Text can be obtained separately with:
print(doc.load_page(0).get_text('dict'))
but I'm more looking for a general method, rather than one specific for text elements, one for other objects, etc.
Question: how to print all objects present in a PDF file? (text blocks, images, vector shapes, etc.)
Notes:
I've already read How to extract text from a PDF file? and similar questions but this is specific to text, whereas I'm looking for all objects / attributes.
I already read How to open PDF raw? but here it did not help
When opening a PDF with a text editor, we see a lot of human-unreadable binary data (it seems that it is not only for images).
TL;DR: I'm looking for a representation like:
Object0
TYPE:TEXT
CONTENT:lorem ipsum
POSITION:123,123
Object1
TYPE:IMAGE
...
Object2
TYPE:...
...
Upvotes: 4
Views: 4931
Reputation: 2841
from pdfminer.high_level import extract_pages
for page_layout in extract_pages("package-development.pdf"):
for element in page_layout:
print(element)
Here is an excerpt of the output:
<LTCurve 107.618,503.487,189.905,517.616>
<LTCurve 102.197,514.574,108.618,520.488>
<LTLine 742.906,690.178,1075.102,690.178>
<LTLine 185.379,36.023,1076.811,36.023>
<LTTextBoxHorizontal(0) 26.285,763.437,113.660,790.387 '" man/\n'>
<LTTextBoxHorizontal(1) 30.624,741.936,351.528,753.936 'The documentation will become the help pages in your package.\n'>
<LTTextBoxHorizontal(2) 29.272,711.703,329.175,726.553 '☑ Document each function with a roxygen block above its \n'>
<LTTextBoxHorizontal(3) 54.022,700.203,353.671,712.083 'definition. In RStudio, Code > Insert Roxygen Skeleton helps. \n'>
<LTTextBoxHorizontal(4) 29.272,674.803,351.319,689.653 '☑ Document each dataset with roxygen block above the name \n'>
<LTTextBoxHorizontal(5) 54.022,663.303,175.507,675.183 'of the dataset in quotes. \n'>
<LTTextBoxHorizontal(6) 29.272,637.903,299.154,652.753 '☑ Document the package with use_package_doc().\n'>
<LTTextBoxHorizontal(7) 384.050,765.279,450.186,779.279 'ROXYGEN2\n'>
<LTTextBoxHorizontal(8) 378.664,694.106,709.876,754.606 'The roxygen2 package lets you write documentation \ninline in your .R files with shorthand syntax. \n• Add roxygen documentation as comments beginning with #’. \n• Place a roxygen @ tag (right) after #’ to supply a specific section \n'>
Upvotes: 1
Reputation: 2841
Bare with me, please.
This isn't an answer but is really a complex comment in response to the overloaded use of the term "object" not only by the OP and commenters, but also by the PDF spec itself.
PDF has first-class support for booleans, integers, real numbers, strings, names, arrays, dictionaries, streams, and a singleton null object. But instead of describing the document as one giant dictionary, PDF allows defining objects with an object-id and referencing it later by the object-id. These are called indirect objects. The PDF document is actually just a bag of objects, with an index and pointer to the "root" object at the tail of the file.
These objects in the PDF that have an object-id is what is typically meant by the informal use of the term objects in a PDF. These are used to describe the structure of the document and all the resources that are needed to produce the document. However these objects hold none of the actual content.
Streams are used to hold a small postfix-based command language that is interpreted by the PDF viewer. Here is an example from https://brendanzagaeski.appspot.com/0004.html showing an actual valid snippet of PDF that shows an indirect object with object-id 4 and of type stream. My comments on the right.
4 0 obj begin indirect object 4
<< /Length 55 >> { 'Length': 55}
stream begin stream type
BT begin-text-object command
/F1 18 Tf change-font to font with descriptor F1 at size 18pt
0 0 Td position-text at x=0, y=0
(Hello World) Tj render-text "Hello World"
ET end-text-object command
endstream end stream type
endobj end object
The PDF spec refers to all of the elements instantiated by commands inside of a stream as "graphic objects". Yes even text objects are graphics objects. However these objects aren't declared with properties, they are defined by instructions on how to build them with an overarching state machine as shown below.
So the twist, if you want all the graphics objects in the following form:
{ 'content': [
{ 'type': 'text', 'position': [0,0], 'text': "Hello World"
]}
you have to build an interpreter to keep track of the graphics state and store away the objects as they get created when the commands are executed by the interpreter. A basic PDF viewer doesn't have to do this because the interpreter maps closely to the graphics api and the graphics state held by the graphics layer.
Do you mean:
All images came out of the PDF specification
https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf
Upvotes: 6
Reputation: 164
You can try using pdfplumber
import pdfplumber
with pdfplumber.open("path/to/file.pdf") as pdf:
first_page = pdf.pages[0]
print(first_page.objects)
Read more at pdfplumber
Upvotes: 1