Reputation: 2492
QUESTION: How do I use the child
object (below) to actually get the paragraph or table object?
This is based on the answer found here, which referenced docx Issue 40.
Unfortunately, none of the code posted there appears to work with commit e784a73 but I was able to get close by an examination of the code (and trial and error)
I have the following ...
def iter_block_items(parent):
"""
Yield each paragraph and table child within *parent*, in document order.
Each returned value is an instance of either Table or Paragraph.
"""
print(type(parent))
if isinstance(parent, docx.document.Document):
parent_elm = doc.element.body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
for child in parent_elm.iter():
if isinstance(child, docx.oxml.text.paragraph.CT_P):
yield ("(paragraph)", child)
elif isinstance(child, docx.oxml.table.CT_Tbl):
yield ("(table)", child)
for i in iter_block_items(doc):
print(i)
This successfully iterates through the elements, and gives me the following output ...
doc= <class 'docx.document.Document'>
<class 'docx.document.Document'>
('(table)', <CT_Tbl '<w:tbl>' at 0x10c9ce0e8>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9ceef8>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9ce0e8>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9cef98>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9ce0e8>)
('(table)', <CT_Tbl '<w:tbl>' at 0x10c9ceef8>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9cef48>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9cef48>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9cef98>)
All I need at this point is: the text from each paragraph, and the table object for the table - so I can iterate through its cells.
But child.text
(for a paragraph) does not return the paragraph text (as it would in the example below) because the child
object does not actually the paragraph object, but an element object that should be able to 'get' it.
for para in doc.paragraphs:
print(para.text)
EDIT:
I've tried:
yield child.text
(yields "None")
and
from docx.text import paragraph
yield paragraph(child)
(Errors with TypeError: 'module' object is not callable)
and
from docx.oxml.text import paragraph
yield paragraph(child)
(Errors with TypeError: 'module' object is not callable)
Upvotes: 2
Views: 3338
Reputation: 28893
You need to instantiate proxy objects for each element if you want the API properties and methods. That's where those live.
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
yield Table(child, parent)
This will generate Paragraph
and Table
objects. A Paragraph
object has a .text
property. For a table you'll need to dig down to cells.
What you were getting with your code is the underlying XML element objects, which use a low-level lxml
interface (actually elaborated a bit with the so-called oxml
and/or xmlchemy
interface) which is lower level than you probably want unless you're extending the behaviors of a proxy object like Paragraph
.
Upvotes: 2