user2743
user2743

Reputation: 1513

Trying to use Python to parse a docx document in xml format to print words that are in bold

I have a word docx file that I would like to print the words that are in Bold looking through the document in xml format it seems the words I'm looking to print have the following attribute.

<w:r w:rsidRPr="00510F21">
  <w:rPr><w:b/>
     <w:noProof/>
     <w:sz w:val="22"/>
     <w:szCs w:val="22"/>
  </w:rPr>
  <w:t>Print this Sentence</w:t>
</w:r>

Specifically the w:rsidRPr="00510F21" attribute which specifies that the text is bold. Below is more of the XML document give a better idea of the structure.

<w:p w14:paraId="64E19BC3" w14:textId="4D8C930F" w:rsidR="00FF6AD1" w:rsidRDefault="00FF6AD1" w:rsidP="00C11B48">
<w:pPr>
   <w:ind w:left="360" w:hanging="360"/>
   <w:jc w:val="both"/>
   <w:rPr>
       <w:sz w:val="22"/>
       <w:szCs w:val="22"/>
   </w:rPr>
 </w:pPr>
 <w:r>
    <w:rPr><w:b/>
       <w:noProof/><w:sz w:val="22"/>
       <w:szCs w:val="22"/>
    </w:rPr><w:t xml:space="preserve">Some text</w:t>
 </w:r>
 <w:r w:rsidRPr="0009466D">
     <w:rPr><w:i/><w:noProof/>
          <w:sz w:val="22"/><w:szCs w:val="22"/>
     </w:rPr>
     <w:t>For example</w:t>
 </w:r>
 <w:r>
     <w:rPr>
        <w:noProof/>
        <w:sz w:val="22"/>
        <w:szCs w:val="22"/>
     </w:rPr><w:t xml:space="preserve">
     </w:t>
 </w:r>
 <w:r w:rsidRPr="00510F21">
     <w:rPr>
         <w:b/>
         <w:noProof/>
         <w:sz w:val="22"/>
         <w:szCs w:val="22"/>
     </w:rPr>
     <w:t>Print this stuff</w:t>
 </w:r>

After doing some research and trying to do this with the Python-docx library I've decided to try using lxml. I was getting an error about the namespace and tried to add that namespace but it's returning an empty set. Below is some of the namespace stuff from the document.

<w:document
xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" 
xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" 
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
xmlns:mv="urn:schemas-microsoft-com:mac:vml" 
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" 
xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"  xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" 
xmlns:w10="urn:schemas-microsoft-com:office:word" 
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" 
xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml"
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"            xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk"
xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" 
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
mc:Ignorable="w14 w15 wp14">

Below is the code I'm using. Again I'd like to print if the attribute is w:rsidRPr="00510F21".

from lxml import etree
root = etree.parse("document.xml")

namespaces = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

wr_roots = root.findall('w:r', namespaces)
print wr_roots # prints empty set

for atype in wr_roots:
   if w:rsidRPr == '00510F21':
       print(atype.get('w:t'))

Upvotes: 5

Views: 13223

Answers (2)

mhawke
mhawke

Reputation: 87084

If you want to find all the bold text you can use findall() with an xpath expression:

from lxml import etree

namespaces = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

root = etree.parse('document.xml').getroot()
for e in root.findall('.//w:r/w:rPr/w:b/../../w:t', namespaces):
    print(e.text)

Instead of looking for w:r nodes with w:rsidRPr="00510F21" as an attribute (which I am not convinced denotes bolded text), look for run nodes (w:r) with w:b in the run properties tag (w:rPr), and then access the text tag (w:t) within. The w:b tag is the bold property as documented here.

The xpath expression can be simplified to './/w:b/../../w:t' although this is less rigorous and might result in false matches.

Upvotes: 7

Parfait
Parfait

Reputation: 107652

Consider lxml's xpath() method. Recall .get() retrieves attributes and .find() retrieves nodes. And because the XML has namespaces in attributes, you will need to prefix the URI in the .get() call. Finally, use the .nsmap object to retrieve all namespaces at the root of the document.

from lxml import etree
doc = etree.parse("document.xml")
root = doc.getroot()

for wr_roots in doc.xpath('//w:r', namespaces=root.nsmap):
    if wr_roots.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRPr')\
       == '00510F21':
        print(wr_roots.find('w:t', namespaces=root.nsmap).text)

# Print this stuff

Upvotes: 1

Related Questions