helloworld
helloworld

Reputation: 89

Read Docx files via python

Does anyone know a python library to read docx files?

I have a word document that I am trying to read data from.

Upvotes: 6

Views: 30084

Answers (5)

Todd Vanyo
Todd Vanyo

Reputation: 593

python-docx can read as well as write.

doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
    allText.append(docpara.text)

Now all paragraphs will be in the list allText.

Thanks to Automate the Boring Stuff with Python by Al Sweigart for the pointer.

Upvotes: 6

There are a couple of packages that let you do this. Check

  1. python-docx.

  2. docx2txt (note that it does not seem to work with .doc). As per this, it seems to get more info than python-docx. From original documentation:

import docx2txt

# extract text
text = docx2txt.process("file.docx")

# extract text and write images in /tmp/img_dir
text = docx2txt.process("file.docx", "/tmp/img_dir") 
  1. textract (which works via docx2txt).

  2. Since .docx files are simply .zip files with a changed extension, this shows how to access the contents. This is a significant difference with .doc files, and the reason why some (or all) of the above do not work with .docs. In this case, you would likely have to convert doc -> docx first. antiword is an option.

Upvotes: 9

Sri
Sri

Reputation: 2328

See this library that allows for reading docx files https://python-docx.readthedocs.io/en/latest/

You should use the python-docx library available on PyPi. Then you can use the following

doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
    allText.append(docpara.text)

Upvotes: 2

zaid.mohammed
zaid.mohammed

Reputation: 9

import docx

def main():
    try:
        doc = docx.Document('test.docx')  # Creating word reader object.
        data = ""
        fullText = []
        for para in doc.paragraphs:
            fullText.append(para.text)
            data = '\n'.join(fullText)

        print(data)

    except IOError:
        print('There was an error opening the file!')
        return


if __name__ == '__main__':
    main()

and dont forget to install python-docx using (pip install python-docx)

Upvotes: 0

William Jackson
William Jackson

Reputation: 1165

A quick search of PyPI turns up the docx package.

Upvotes: 1

Related Questions