RightmireM
RightmireM

Reputation: 2492

Get docx element from doc.element.iter()

QUESTION: How do I use the child object (below) to actually get the paragraph or table object?

This is based on the answer found here, which referenced docx Issue 40.

Unfortunately, none of the code posted there appears to work with commit e784a73 but I was able to get close by an examination of the code (and trial and error)

I have the following ...

def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph.
    """
    print(type(parent))
    if isinstance(parent, docx.document.Document):
        parent_elm = doc.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iter():
        if isinstance(child, docx.oxml.text.paragraph.CT_P):
            yield ("(paragraph)", child)
        elif isinstance(child, docx.oxml.table.CT_Tbl):
            yield ("(table)", child)

for i in iter_block_items(doc): 
    print(i)

This successfully iterates through the elements, and gives me the following output ...

doc= <class 'docx.document.Document'>
<class 'docx.document.Document'>
('(table)', <CT_Tbl '<w:tbl>' at 0x10c9ce0e8>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9ceef8>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9ce0e8>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9cef98>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9ce0e8>)
('(table)', <CT_Tbl '<w:tbl>' at 0x10c9ceef8>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9cef48>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9cef48>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9cef98>)

All I need at this point is: the text from each paragraph, and the table object for the table - so I can iterate through its cells.

But child.text (for a paragraph) does not return the paragraph text (as it would in the example below) because the child object does not actually the paragraph object, but an element object that should be able to 'get' it.

for para in doc.paragraphs:
    print(para.text)

EDIT:

I've tried:

yield child.text
(yields "None")

and

from docx.text import paragraph
yield paragraph(child)
(Errors with TypeError: 'module' object is not callable)

and

from docx.oxml.text import paragraph
yield paragraph(child)
(Errors with TypeError: 'module' object is not callable)

Upvotes: 2

Views: 3338

Answers (1)

scanny
scanny

Reputation: 28893

You need to instantiate proxy objects for each element if you want the API properties and methods. That's where those live.

if isinstance(child, CT_P):
    yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
    yield Table(child, parent)

This will generate Paragraph and Table objects. A Paragraph object has a .text property. For a table you'll need to dig down to cells.

What you were getting with your code is the underlying XML element objects, which use a low-level lxml interface (actually elaborated a bit with the so-called oxml and/or xmlchemy interface) which is lower level than you probably want unless you're extending the behaviors of a proxy object like Paragraph.

Upvotes: 2

Related Questions