Stefan
Stefan

Reputation: 443

Python docx paragraph in textbox

Is there any way to access and manipulate text in an existing docx document in a textbox with python-docx?

I tried to find a keyword in all paragraphs in a document by iteration:

doc = Document('test.docx')

for paragraph in doc.paragraphs:
    if '<DATE>' in paragraph.text:
        print('found date: ', paragraph.text)

It is found if placed in normal text, but not inside a textbox.

Upvotes: 14

Views: 13514

Answers (2)

Stefan
Stefan

Reputation: 443

A workaround for textboxes that contain only formatted text is to use a floating, formatted table. It can be styled almost like a textbox (frames, colours, etc.) and is easily accessible by the docx API.

doc = Document('test.docx')

for table in doc.tables:
    for row in table.rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                if '<DATE>' in paragraph.text:
                   print('found date: ', paragraph.text)

Upvotes: 8

scanny
scanny

Reputation: 28893

Not via the API, not yet at least. You'd have to uncover the XML structure it lives in and go down to the lxml level and perhaps XPath to find it. Something like this might be a start:

body = doc._body
# assuming differentiating container element is w:textBox
text_box_p_elements = body.xpath('.//w:textBox//w:p')

I have no idea whether textBox is the actual element name here, you'd have to sort that out with the rest of the XPath path details, but this approach will likely work. I use similar approaches frequently to work around features that aren't built into the API yet.

opc-diag is a useful tool for inspecting the XML. The basic approach is to create a minimally small .docx file containing the type of thing you're trying to locate. Then use opc-diag to inspect the XML Word generates when you save the file:

$ opc browse test.docx document.xml

http://opc-diag.readthedocs.org/en/latest/index.html

Upvotes: 9

Related Questions