Reputation: 1379
i wanted to convert a word document to text. So i used a script.
import win32com.client
app = win32com.client.Dispatch('Word.Application')
doc = app.Documents.Open(r'C:\Users\SBYSMR10\Desktop\New folder (2)\GENERAL DATA.doc')
content=doc.Content.Text
app.Quit()
print content
i have the folllowing result:
Now i want to convert this text into a list which contains all its items. I used
content = " ".join(content.replace(u"\xa0", " ").strip().split())
EDIT
When i do that, i get :
Its not a list. What is the problem? What is that big dot character?
Upvotes: 1
Views: 13374
Reputation: 299
You could just parse the word document line by line. It isn't elegant and it certainly isn't pretty but it works. Here's a snippet from something similar I've done in python 3.3.
import os
directory='your/path/to/file/'
file='yourword.doc'
doc=open(directory+file,'r+b')
for line in doc:
line2=str(line)
print(line2))
I used a regular expression to get just what I needed. But this code will read each line of your word document (formatting and all) and convert it to nice strings that you can deal with. Not sure if this is helpful at all (this post is a couple of years old) but at least it parses the word document. Then it's just a matter of getting rid of strings you don't want before writing to a txt file.
Upvotes: 0
Reputation: 658
check this post in this link and its comments : Converting Word documents to text (Python recipe)
also this post may be useful: python convert microsoft office docs to plain text on linux
Upvotes: 0
Reputation: 4348
Now i want to convert this text into a list which contains all its items. I used
content = " ".join(content.replace(u"\xa0", " ").strip().split())
Its not a list. What is the problem?
The .join method always returns a string. It expects you to pass a list and will then concatenate that list with the given delimiter (" " in your case).
Apart from that, what Aaron Digulla said.
Upvotes: 0
Reputation: 328594
Word documents aren't text, they are documents: They have control information (like formatting) and text. If you ignore the control information, the text is pretty useless.
So you have to dig into the details how to navigate the control structure of the document to find the texts that you're interested in and then get the text content of that structures.
Note: You'll find that Word is very complex. If you can, consider these two approaches as well:
Save the Word document as HTML from within Word. It'll lose some formatting but lists will stay intact. HTML is much more simple to parse and understand than Word.
save the document as OOXML (exists at least since Office 10, the extension is .docx
). This is a ZIP archive with XML documents inside. The XML is again easier to parse/understand than the full Word document but harder than the HTML version.
Upvotes: 9