Reputation: 97
I'm looking for a means to extract the position (x, y) and attributes (font / size) of every word in a document.
From the python-docx docs, I know that :
Conceptually, Word documents have two layers, a text layer and a drawing layer. In the text layer, text objects are flowed from left to right and from top to bottom, starting a new page when the prior one is filled. In the drawing layer, drawing objects, called shapes, are placed at arbitrary positions. These are sometimes referred to as floating shapes.
A picture is a shape that can appear in either the text or drawing layer. When it appears in the text layer it is called an inline shape, or more specifically, an inline picture.
[...] At the time of writing, python-docx only supports inline pictures.
Yet, even if it is not the gist of it, I'm wondering if something similar exists :
from docx import Document
main_file = Document("/tmp/file.docx")
for paragraph in main_file.paragraphs:
for word in paragraph.text: # <= Non-existing (yet wished) functionnalities, IMHO
print(word.x, word.y) # <= Non-existing (yet wished) functionnalities, IMHO
Does somebody has an idea ? Best, Arthur
Upvotes: 3
Views: 3061
Reputation: 28903
for word in paragraph.text: # <= Non-existing (yet wished) functionalities, IMHO
This functionality is provided right in the Python library as str.split()
. These can be composed easily as:
for word in paragraph.text.split():
...
Regarding
print(word.x, word.y) # <= Non-existing (yet wished) functionnalities, IMHO
I think it's safe to say this functionality will never appear in python-docx
, and if it did it could not look like this.
What such a feature would be doing is asking the page renderer for the location at which the renderer was going to place those characters. python-docx
has no rendering engine (because it does not render documents); it is simply a fancy XML editor that selectively modifies XML files in the WordprocessingML vocabulary.
It may be possible to get these values from Word itself, because Word does have a rendering engine (which it uses for screen display and printing).
If there was such a function, I expect it would take a paragraph and a character offset within that paragraph, or something more along those lines, like document.position(paragraph, offset=42)
or perhaps paragraph.position(offset=42)
.
Upvotes: 3