tcpiper
tcpiper

Reputation: 2544

python docx how to read text along with inline images?

I have a simple docx file like this(just insert a inline png file to text):enter image description here

I've tried:

>>> x=docx.Document('12.docx')
>>> for p in x.paragraphs:
    print(p.text)


headend
>>> list(x.inline_shapes)
[]

And I unzip 12.docx file, found word/media/image1.png is the location. So is there a way to get a output like:

>>> for p in x.paragraphs:
    print(p.text_with_image_info)


head<word/media/image1.png>end

Upvotes: 1

Views: 3842

Answers (1)

scanny
scanny

Reputation: 28903

You should be able to get a list of inline shapes like this:

>>> [s for s in x.inline_shapes]
[<InlineShape object at 0x...>]

If none show up then you'd probably need to examine the XML to find out why it's not finding anything at the XPath location '//w:p/w:r/w:drawing/wp:inline'. That might yield an interesting finding if you're seeing an empty list there.

Regarding the bit about getting the text with image in document order, you'll need to go down to the lxml layer.

You can get the paragraph lxml element w:p using Paragraph._element. From there you can inspect the XML with the .xml property:

>>> p = paragraph._p
>>> p.xml
'<w:p> etc ...'

You'll need to iterate through the children of the w:p element, I expect you'll find primarily w:r (run) elements. Text is held below those in w:t elements and a w:drawing element is a peer of w:t if I'm not mistaken.

You can construct python-docx objects like InlineShape with the right child element to get access to a more convenient API once you've located the right bit.

So it's a bit of work but doable if you're up to working with lxml-level calls.

Upvotes: 3

Related Questions