Reputation: 8757
I have an MS Word document contains some text and headings, I want to extract the headings, I installed Python for win32, but I didn't know which method to use, it seems the help document of python for windows does not list the functions of the word obejct. take the following code as example
import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("MyDocument")
doc = word.ActiveDocument
how can I know all the functions of the word object?I didn't find anything useful in the help document.
Upvotes: 5
Views: 10023
Reputation: 1178
convert word to docx and use python docx module
from docx import Document
file = 'test.docx'
document = Document(file)
for paragraph in document.paragraphs:
if paragraph.style.name == 'Heading 1':
print(paragraph.text)
Upvotes: 4
Reputation: 37279
The Word object model can be found here. Your doc
object will contain these properties, and you can use them to perform your desired actions (note that I haven't used this feature with Word, so my knowledge of the object model is sparse). For instance, if you wanted to read all the words in a document, you could do:
for word in doc.Words:
print word
And you would get all of the words. Each of those word
items would be a Word
object (reference here), so you could access those properties during iteration. In your case, here is how you would get the style:
for word in doc.Words:
print word.Style
On a sample doc with a single Heading 1 and normal text, this prints:
Heading 1
Heading 1
Heading 1
Heading 1
Heading 1
Normal
Normal
Normal
Normal
Normal
To group the headings together, you can use itertools.groupby
. As explained in the code comments below, you need to reference the str()
of the object itself, as using word.Style
returns an instance that won't properly group with other instances of the same style:
from itertools import groupby
import win32com.client as win32
# All the same as yours
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("testdoc.doc")
doc = word.ActiveDocument
# Here we use itertools.groupby (without sorting anything) to
# find groups of words that share the same heading (note it picks
# up newlines). The tricky/confusing thing here is that you can't
# just group on the Style itself - you have to group on the str().
# There was some other interesting behavior, but I have zero
# experience with COMObjects so I'll leave it there :)
# All of these comments for two lines of code :)
for heading, grp_wrds in groupby(doc.Words, key=lambda x: str(x.Style)):
print heading, ''.join(str(word) for word in grp_wrds)
This outputs:
Heading 1 Here is some text
Normal
No header
If you replace the join
with a list comprehension, you get the below (where you can see the newlines):
Heading 1 ['Here ', 'is ', 'some ', 'text', '\r']
Normal ['\r', 'No ', 'header', '\r', '\r']
Upvotes: 4
Reputation: 41667
You can also use the Google Drive SDK to convert the Word document to something more useful, like HTML, where you can easily extract the headers.
https://developers.google.com/drive/manage-uploads
Upvotes: 2