Italo Lemos
Italo Lemos

Reputation: 1022

Read .doc file with python

I got a test for job application, my deal is read some .doc files. Does anyone know a library to do this? I had started with a raw python code:

f = open('test.doc', 'r')
f.read()

but this does not return a friendly string I need to convert it to utf-8

Edit: I just want get the text from this file

Upvotes: 51

Views: 256931

Answers (11)

admin
admin

Reputation: 169

pip install edoc

>>> import edoc
>>> edoc.extraxt_txt(file_path)
'It was a dark and stormy night.'

Upvotes: 0

10SecTom
10SecTom

Reputation: 2664

I was trying to do the same, and I found lots of information on reading .docx but much less on .doc ; Anyway, I managed to read the text using the following:

import win32com.client

word = win32com.client.Dispatch("Word.Application")
word.visible = False
wb = word.Documents.Open("myfile.doc")
doc = word.ActiveDocument
print(doc.Range().Text)

Edit:

To close everything completely, it is better to append this:

# close the document
doc.Close(False)

# quit Word
word.Quit()

Also, note that you should use absolute path for your .doc file, not the relative one. So use this to get the absolute path:

import os

# for example, ``rel_path`` could be './myfile.doc'
full_path = os.path.abspath(rel_path)

Upvotes: 28

Nishant Verma
Nishant Verma

Reputation: 1

This code will run when if you are looking for how to read the doc file in python install the all related packages first and see the result.

if doc_file:

    _file=requests.get(request.values['MediaUrl0'])

    doc_file_link=BytesIO(_file.content)

    file_path=os.getcwd()+'\+data.doc'

    E=open(file_path,'wb')
    E.write(doc_file_link.getbuffer())
    E.close()

    word = win32.gencache.EnsureDispatch('Word.Application',pythoncom.CoInitialize())
    doc = word.Documents.Open(file_path)
    doc.Activate()
    doc_data=doc.Range().Text
    print(doc_data)
    doc.Close(False)

    if os.path.exists(file_path):
       os.remove(file_path)

Upvotes: 0

Venkata Ramana
Venkata Ramana

Reputation: 25

!pip install python-docx

import docx

#Creating a word file object
doc = open("file.docx","rb")

#creating word reader object
document = docx.Document(doc)

Upvotes: -1

zzhapar
zzhapar

Reputation: 135

I looked for solution so long. Materials about .doc file is not enough, finally I solved this problem by changing type .doc to .docx

from win32com import client as wc
w = wc.Dispatch('Word.Application')
# Or use the following method to start a separate process:
# w = wc.DispatchEx('Word.Application')
doc=w.Documents.Open(os.path.abspath('test.doc'))
doc.SaveAs("test_docx.docx",16)

Upvotes: 3

Viktor
Viktor

Reputation: 32

I had to do the same to search through a ton of *.doc files for a specific number and came up with:

special_chars = {
    "b'\\t'": '\t',
    "b'\\r'": '\n',
    "b'\\x07'": '|',
    "b'\\xc4'": 'Ä',
    "b'\\xe4'": 'ä',
    "b'\\xdc'": 'Ü',
    "b'\\xfc'": 'ü',
    "b'\\xd6'": 'Ö',
    "b'\\xf6'": 'ö',
    "b'\\xdf'": 'ß',
    "b'\\xa7'": '§',
    "b'\\xb0'": '°',
    "b'\\x82'": '‚',
    "b'\\x84'": '„',
    "b'\\x91'": '‘',
    "b'\\x93'": '“',
    "b'\\x96'": '-',
    "b'\\xb4'": '´'
}


def get_string(path):
    string = ''
    with open(path, 'rb') as stream:
        stream.seek(2560) # Offset - text starts after byte 2560
        current_stream = stream.read(1)
        while not (str(current_stream) == "b'\\xfa'"):
            if str(current_stream) in special_chars.keys():
                string += special_chars[str(current_stream)]
            else:
                try:
                    char = current_stream.decode('UTF-8')
                    if char.isalnum():
                        string += char
                except UnicodeDecodeError:
                    string += ''
            current_stream = stream.read(1)
    return string

I'm not sure how 'clean' this solution is, but it works well with regex.

Upvotes: 0

lucas F
lucas F

Reputation: 381

The answer from Shivam Kotwalia works perfectly. However, the object is imported as a byte type. Sometimes you may need it as a string for performing REGEX or something like that.

I recommend the following code (two lines from Shivam Kotwalia's answer) :

import textract

text = textract.process("path/to/file.extension")
text = text.decode("utf-8") 

The last line will convert the object text to a string.

Upvotes: 15

Rahul Nimbal
Rahul Nimbal

Reputation: 585

I agree with Shivam's answer except for textract doesn't exist for windows. And, for some reason antiword also fails to read the '.doc' files and gives an error:

'filename.doc' is not a word document. # This happens when the file wasn't generated via MS Office. Eg: Web-pages may be stored in .doc format offline.

So, I've got the following workaround to extract the text:

from bs4 import BeautifulSoup as bs
soup = bs(open(filename).read())
[s.extract() for s in soup(['style', 'script'])]
tmpText = soup.get_text()
text = "".join("".join(tmpText.split('\t')).split('\n')).encode('utf-8').strip()
print text

This script will work with most kinds of files. Have fun!

Upvotes: 7

Aslam Shaik
Aslam Shaik

Reputation: 1971

Prerequisites :

install antiword : sudo apt-get install antiword

install docx : pip install docx

from subprocess import Popen, PIPE

from docx import opendocx, getdocumenttext
from cStringIO import StringIO
def document_to_text(filename, file_path):
    cmd = ['antiword', file_path]
    p = Popen(cmd, stdout=PIPE)
    stdout, stderr = p.communicate()
    return stdout.decode('ascii', 'ignore')

print document_to_text('your_file_name','your_file_path')

Notice – New versions of python-docx removed this function. Make sure to pip install docx and not the new python-docx

Upvotes: 5

Billal BEGUERADJ
Billal BEGUERADJ

Reputation: 22804

You can use python-docx2txt library to read text from Microsoft Word documents. It is an improvement over python-docx library as it can, in addition, extract text from links, headers and footers. It can even extract images.

You can install it by running: pip install docx2txt.

Let's download and read the first Microsoft document on here:

import docx2txt
my_text = docx2txt.process("test.docx")
print(my_text)

Here is a screenshot of the Terminal output the above code:

enter image description here

EDIT:

This does NOT work for .doc files. The only reason I am keep this answer is that it seems there are people who find it useful for .docx files.

Upvotes: 36

Shivam Kotwalia
Shivam Kotwalia

Reputation: 1503

One can use the textract library. It take care of both "doc" as well as "docx"

import textract
text = textract.process("path/to/file.extension")

You can even use 'antiword' (sudo apt-get install antiword) and then convert doc to first into docx and then read through docx2txt.

antiword filename.doc > filename.docx

Ultimately, textract in the backend is using antiword.

Upvotes: 60

Related Questions