griff
griff

Reputation: 131

Pulling data out of MS Word with pywin32

I am running python 3.3 in Windows and I need to pull strings out of Word documents. I have been searching far and wide for about a week on the best method to do this. Originally I tried to save the .docx files as .txt and parse through using RE's, but I had some formatting problems with hidden characters - I was using a script to open a .docx and save as .txt. I am wondering if I did a proper File>SaveAs>.txt would it strip out the odd formatting and then I could properly parse through? I don't know but I gave up on this method.

I tried to use the docx module but I've been told it is not compatible with python 3.3. So I am left with using pywin32 and the COM. I have used this successfully with Excel to get the data I need but I am having trouble with Word because there is FAR less documentation and reading through the object model on Microsoft's website is over my head.

Here is what I have so far to open the document(s):

import win32com.client as win32
import glob, os

word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = True

for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
    print(infile)
    doc = word.Documents.Open(infile)

So at this point I can do something like

print(doc.Content.Text) 

And see the contents of the files, but it still looks like there is some odd formatting in there and I have no idea how to actually parse through to grab the data I need. I can create RE's that will successfully find the strings that I'm looking for, I just don't know how to implement them into the program using the COM.

The code I have so far was mostly found through Google. I don't even think this is that hard, it's just that reading through the object model on Microsoft's website is like reading a foreign language. Any help is MUCH appreciated. Thank you.

Edit: code I was using to save the files from docx to txt:

for path, dirs, files in os.walk(r'mypath'):
    for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.docx')]:
        print("processing %s" % doc)
        wordapp.Documents.Open(doc)
        docastxt = doc.rstrip('docx') + 'txt'
        wordapp.ActiveDocument.SaveAs(docastxt,FileFormat=win32com.client.constants.wdFormatText)
        wordapp.ActiveDocument.Close()

Upvotes: 4

Views: 7145

Answers (2)

Fred the Fantastic
Fred the Fantastic

Reputation: 1345

oodocx is my fork of the python-docx module that is fully compatible with Python 3.3. You can use the replace method to do regular expression searches. Your code would look something like:

from oodocx import oodocx

d = oodocx.Docx('myfile.docx')
d.replace('searchstring', 'replacestring')
d.save('mynewfile.docx')

If you just want to remove strings, you can pass an empty string to the "replace" parameter.

Upvotes: 0

abarnert
abarnert

Reputation: 365587

If you don't want to learn the complicated way Word models documents, and then how that's exposed through the Office object model, a much simpler solution is to have Word save a plain-text copy of the file.

There are a lot of options here. Use tempfile to create temporary text files and then delete them, or store permanent ones alongside the doc files for later re-use? Use Unicode (which, in Microsoft speak, means UTF-16-LE with a BOM) or encoded text? And so on. So, I'll just pick something reasonable, and you can look at the Document.SaveAs, WdSaveFormat, etc. docs to modify it.

wdFormatUnicodeText = 7

for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
    print(infile)
    doc = word.Documents.Open(infile)
    txtpath = os.path.splitext('infile')[0] + '.txt'
    doc.SaveAs(txtpath, wdFormatUnicodeText)
    doc.Close()
    with open(txtpath, encoding='utf-16') as f:
        process_the_file(f)

As noted in your comments, what this does to complex things like tables, multi-column text, etc. may not be exactly what you want. In that case, you might want to consider saving as, e.g., wdFormatFilteredHTML, which Python has nice parsers for. (It's a lot easier to BeautifulSoup a table than to win32com-Word it.)

Upvotes: 3

Related Questions