Reputation: 821
I am trying to read a .doc
file in python and I don't want to use textract because of the OS dependency. Also I wouldn't want to use docx2txt
because as far as I understand it doesn't read .doc
but only .docx
files.
Are there any similar modules or can this even be achieved without library support?
Upvotes: 2
Views: 6095
Reputation: 2237
One way is to use Python's win32com module. win32 can be downloaded with the pip command pip install pywin32
. This can read the .doc document and return the text. Try this:
import win32com.client
word = win32com.client.Dispatch("Word.Application")
word.visible = False
wb = word.Documents.Open(r"C:\Users\main\OneDrive\Documents\User\Paper.doc")
doc = word.ActiveDocument
print(doc.Range().Text)
Another way would be to use BeautifulSoup, but this method could become a little bit buggy:
from bs4 import BeautifulSoup as bs
soup = bs(open(r"C:\Users\main\OneDrive\Documents\User\Paper.doc", encoding="ISO-8859-1").read())
[s.extract() for s in soup(['style', 'script'])]
tmpText = soup.get_text()
text = "".join("".join(tmpText.split('\t')).split('\n')).strip()
print(text)
Upvotes: 4