Reputation: 2544
I have a simple docx file like this(just insert a inline png file to text):
I've tried:
>>> x=docx.Document('12.docx')
>>> for p in x.paragraphs:
print(p.text)
headend
>>> list(x.inline_shapes)
[]
And I unzip 12.docx
file, found word/media/image1.png
is the location. So is there a way to get a output like:
>>> for p in x.paragraphs:
print(p.text_with_image_info)
head<word/media/image1.png>end
Upvotes: 1
Views: 3842
Reputation: 28903
You should be able to get a list of inline shapes like this:
>>> [s for s in x.inline_shapes]
[<InlineShape object at 0x...>]
If none show up then you'd probably need to examine the XML to find out why it's not finding anything at the XPath location '//w:p/w:r/w:drawing/wp:inline'
. That might yield an interesting finding if you're seeing an empty list there.
Regarding the bit about getting the text with image in document order, you'll need to go down to the lxml layer.
You can get the paragraph lxml element w:p
using Paragraph._element
. From there you can inspect the XML with the .xml property:
>>> p = paragraph._p
>>> p.xml
'<w:p> etc ...'
You'll need to iterate through the children of the w:p
element, I expect you'll find primarily w:r
(run) elements. Text is held below those in w:t
elements and a w:drawing
element is a peer of w:t
if I'm not mistaken.
You can construct python-docx objects like InlineShape with the right child element to get access to a more convenient API once you've located the right bit.
So it's a bit of work but doable if you're up to working with lxml-level calls.
Upvotes: 3