Reputation: 89
Does anyone know a python library to read docx files?
I have a word document that I am trying to read data from.
Upvotes: 6
Views: 30084
Reputation: 593
python-docx can read as well as write.
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
Now all paragraphs will be in the list allText.
Thanks to Automate the Boring Stuff with Python by Al Sweigart for the pointer.
Upvotes: 6
Reputation: 15641
There are a couple of packages that let you do this. Check
docx2txt (note that it does not seem to work with .doc
). As per this, it seems to get more info than python-docx.
From original documentation:
import docx2txt
# extract text
text = docx2txt.process("file.docx")
# extract text and write images in /tmp/img_dir
text = docx2txt.process("file.docx", "/tmp/img_dir")
Since .docx
files are simply .zip
files with a changed extension, this shows how to access the contents.
This is a significant difference with .doc
files, and the reason why some (or all) of the above do not work with .doc
s.
In this case, you would likely have to convert doc
-> docx
first. antiword
is an option.
Upvotes: 9
Reputation: 2328
See this library that allows for reading docx files https://python-docx.readthedocs.io/en/latest/
You should use the python-docx library available on PyPi. Then you can use the following
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
Upvotes: 2
Reputation: 9
import docx
def main():
try:
doc = docx.Document('test.docx') # Creating word reader object.
data = ""
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
data = '\n'.join(fullText)
print(data)
except IOError:
print('There was an error opening the file!')
return
if __name__ == '__main__':
main()
and dont forget to install python-docx using (pip install python-docx)
Upvotes: 0