shahid hamdam
shahid hamdam

Reputation: 821

how to read doc file with python NOT Docx

I am trying to read a .doc file in python and I don't want to use textract because of the OS dependency. Also I wouldn't want to use docx2txt because as far as I understand it doesn't read .doc but only .docx files.

Are there any similar modules or can this even be achieved without library support?

Upvotes: 2

Views: 6095

Answers (1)

The Pilot Dude
The Pilot Dude

Reputation: 2237

One way is to use Python's win32com module. win32 can be downloaded with the pip command pip install pywin32. This can read the .doc document and return the text. Try this:

import win32com.client

word = win32com.client.Dispatch("Word.Application")
word.visible = False
wb = word.Documents.Open(r"C:\Users\main\OneDrive\Documents\User\Paper.doc")
doc = word.ActiveDocument
print(doc.Range().Text)

Another way would be to use BeautifulSoup, but this method could become a little bit buggy:

from bs4 import BeautifulSoup as bs
soup = bs(open(r"C:\Users\main\OneDrive\Documents\User\Paper.doc", encoding="ISO-8859-1").read())
[s.extract() for s in soup(['style', 'script'])]
tmpText = soup.get_text()
text = "".join("".join(tmpText.split('\t')).split('\n')).strip()
print(text)

Upvotes: 4

Related Questions