Trying to use Python to parse a docx document in xml format to print words that are in bold

Question

I have a word docx file that I would like to print the words that are in Bold looking through the document in xml format it seems the words I'm looking to print have the following attribute.


  
     
     
     
  
  Print this Sentence

Specifically the w:rsidRPr="00510F21" attribute which specifies that the text is bold. Below is more of the XML document give a better idea of the structure.



   
   
   
       
       
   
 
 
    
       
       
    Some text
 
 
     
          
     
     For example
 
 
     
        
        
        
     
     
 
 
     
         
         
         
         
     
     Print this stuff

After doing some research and trying to do this with the Python-docx library I've decided to try using lxml. I was getting an error about the namespace and tried to add that namespace but it's returning an empty set. Below is some of the namespace stuff from the document.

Below is the code I'm using. Again I'd like to print if the attribute is w:rsidRPr="00510F21".

from lxml import etree
root = etree.parse("document.xml")

namespaces = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

wr_roots = root.findall('w:r', namespaces)
print wr_roots # prints empty set

for atype in wr_roots:
   if w:rsidRPr == '00510F21':
       print(atype.get('w:t'))

mhawke · Accepted Answer

If you want to find all the bold text you can use findall() with an xpath expression:

from lxml import etree

namespaces = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

root = etree.parse('document.xml').getroot()
for e in root.findall('.//w:r/w:rPr/w:b/../../w:t', namespaces):
    print(e.text)

Instead of looking for w:r nodes with w:rsidRPr="00510F21" as an attribute (which I am not convinced denotes bolded text), look for run nodes (w:r) with w:b in the run properties tag (w:rPr), and then access the text tag (w:t) within. The w:b tag is the bold property as documented here.

The xpath expression can be simplified to './/w:b/../../w:t' although this is less rigorous and might result in false matches.

Trying to use Python to parse a docx document in xml format to print words that are in bold

Answers (2)

Related Questions