Reputation: 1379

Parse Word Document in Python

i wanted to convert a word document to text. So i used a script.

import win32com.client 

app = win32com.client.Dispatch('Word.Application') 
doc = app.Documents.Open(r'C:\Users\SBYSMR10\Desktop\New folder (2)\GENERAL DATA.doc') 
content=doc.Content.Text
app.Quit()
print content

i have the folllowing result:

enter image description here

Now i want to convert this text into a list which contains all its items. I used

content = " ".join(content.replace(u"\xa0", " ").strip().split())

EDIT

When i do that, i get :

enter image description here

Its not a list. What is the problem? What is that big dot character?

Upvotes: 1

Answers (4)

Ryan

Reputation: 299

You could just parse the word document line by line. It isn't elegant and it certainly isn't pretty but it works. Here's a snippet from something similar I've done in python 3.3.

import os
directory='your/path/to/file/'
file='yourword.doc'
doc=open(directory+file,'r+b')
for line in doc:
    line2=str(line)
    print(line2))

I used a regular expression to get just what I needed. But this code will read each line of your word document (formatting and all) and convert it to nice strings that you can deal with. Not sure if this is helpful at all (this post is a couple of years old) but at least it parses the word document. Then it's just a matter of getting rid of strings you don't want before writing to a txt file.

Upvotes: 0

Abdurahman

Reputation: 658

check this post in this link and its comments : Converting Word documents to text (Python recipe)

also this post may be useful: python convert microsoft office docs to plain text on linux

Upvotes: 0

Fabian

Reputation: 4348

Now i want to convert this text into a list which contains all its items. I used

content = " ".join(content.replace(u"\xa0", " ").strip().split())

Its not a list. What is the problem?

The .join method always returns a string. It expects you to pass a list and will then concatenate that list with the given delimiter (" " in your case).

Apart from that, what Aaron Digulla said.

Upvotes: 0

Aaron Digulla

Reputation: 328594

Word documents aren't text, they are documents: They have control information (like formatting) and text. If you ignore the control information, the text is pretty useless.

So you have to dig into the details how to navigate the control structure of the document to find the texts that you're interested in and then get the text content of that structures.

Note: You'll find that Word is very complex. If you can, consider these two approaches as well:

Save the Word document as HTML from within Word. It'll lose some formatting but lists will stay intact. HTML is much more simple to parse and understand than Word.
save the document as OOXML (exists at least since Office 10, the extension is .docx). This is a ZIP archive with XML documents inside. The XML is again easier to parse/understand than the full Word document but harder than the HTML version.

Upvotes: 9

Parse Word Document in Python

Answers (4)

Related Questions