zdd
zdd

Reputation: 8757

Extract headings from a MS Word document in Python

I have an MS Word document contains some text and headings, I want to extract the headings, I installed Python for win32, but I didn't know which method to use, it seems the help document of python for windows does not list the functions of the word obejct. take the following code as example

import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("MyDocument")
doc = word.ActiveDocument

how can I know all the functions of the word object?I didn't find anything useful in the help document.

Upvotes: 5

Views: 10023

Answers (3)

Pankaj Singh
Pankaj Singh

Reputation: 1178

convert word to docx and use python docx module

from docx import Document

file = 'test.docx'
document = Document(file)

for paragraph in document.paragraphs:
    if paragraph.style.name == 'Heading 1':
        print(paragraph.text)

Upvotes: 4

RocketDonkey
RocketDonkey

Reputation: 37279

The Word object model can be found here. Your doc object will contain these properties, and you can use them to perform your desired actions (note that I haven't used this feature with Word, so my knowledge of the object model is sparse). For instance, if you wanted to read all the words in a document, you could do:

for word in doc.Words:
    print word

And you would get all of the words. Each of those word items would be a Word object (reference here), so you could access those properties during iteration. In your case, here is how you would get the style:

for word in doc.Words:
    print word.Style

On a sample doc with a single Heading 1 and normal text, this prints:

Heading 1
Heading 1
Heading 1
Heading 1
Heading 1
Normal
Normal
Normal
Normal
Normal

To group the headings together, you can use itertools.groupby. As explained in the code comments below, you need to reference the str() of the object itself, as using word.Style returns an instance that won't properly group with other instances of the same style:

from itertools import groupby
import win32com.client as win32

# All the same as yours
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("testdoc.doc")
doc = word.ActiveDocument

# Here we use itertools.groupby (without sorting anything) to
# find groups of words that share the same heading (note it picks
# up newlines). The tricky/confusing thing here is that you can't
# just group on the Style itself - you have to group on the str(). 
# There was some other interesting behavior, but I have zero 
# experience with COMObjects so I'll leave it there :)
# All of these comments for two lines of code :)
for heading, grp_wrds in groupby(doc.Words, key=lambda x: str(x.Style)):
  print heading, ''.join(str(word) for word in grp_wrds)

This outputs:

Heading 1 Here is some text

Normal 
No header

If you replace the join with a list comprehension, you get the below (where you can see the newlines):

Heading 1 ['Here ', 'is ', 'some ', 'text', '\r']
Normal ['\r', 'No ', 'header', '\r', '\r']

Upvotes: 4

Ali Afshar
Ali Afshar

Reputation: 41667

You can also use the Google Drive SDK to convert the Word document to something more useful, like HTML, where you can easily extract the headers.

https://developers.google.com/drive/manage-uploads

Upvotes: 2

Related Questions