chen vinson
chen vinson

Reputation: 21

How can I use Python to delete certain paragraphs in docx document?

I have a large .docx document. It has over 100 paragraphs. However, there is some trash paragraph that I need to delete. For example, those paragraphs need to be deleted has a keyword "None". How can I use python to delete those paragraphs have the keyword "None". This is what I have so far, but it can only delete the blank paragraph. How can I modify it to achieve my goal?

import docx

f = docx.Document(r"test.docx")  
doc = docx.Document() 

for para in f.paragraphs:
    if para.text.count("\n") == len(para.text):  
        continue
    else:
        if not para.text[0].isalpha(): 
            continue

    doc.add_paragraph(para.text) 

doc.save(r"test2.docx") 

Upvotes: 2

Views: 5688

Answers (1)

abdul mutal
abdul mutal

Reputation: 31

You should be able to do this for the simple case with this code:

def delete_paragraph(paragraph):
    p = paragraph._element
    p.getparent().remove(p)
    p._p = p._element = None

Any subsequent access to the "deleted" paragraph object will raise AttributeError, so you should be careful not to keep the reference hanging around, including as a member of a stored value of Document.paragraphs.

The reason it's not in the library yet is because the general case is much trickier, in particular needing to detect and handle the variety of linked items that can be present in a paragraph; things like a picture, a hyperlink, or chart etc.

But if you know for sure none of those are present, these few lines should get the job done.

Upvotes: 1

Related Questions