Reputation: 397
I'm trying to read a docx file with file with the following code:
from docx import Document
doc = Document('test.docx')
But when I try to print it, i get this:
<docx.api.Document object at 0x02952C70>
How can I read the content inside the file?
I read that docx changed recently so, the old questions/answers don't apply anymore.
Upvotes: 1
Views: 4231
Reputation: 908
Check out the structure of the Document object here:
For example, if you want to get the property "paragraphs":
doc = Document('test.docx')
paragraphs = doc.paragraphs()
I hope this will help.
EDIT: I have found this snippet in the python-docx's gitHub repository and edited it a little here:
document = docx.Document(filename)
docText = '\n\n'.join([
paragraph.text.encode('utf-8') for paragraph in document.paragraphs
])
print docText
The join() function receives a list of strings encoded in UTF-8 from the paragraphs in the array returned by paragraphs property. So the result would look like:
paragraph 1
paragraph 2
paragraph 3
It looks like this works, but it doesn't print tables, headers or footers.
EDIT: This link is the main index for all documentation about python-docx:
python-docx 0.7.4 documentation
Upvotes: 4
Reputation: 413
It is possible to not use the docx
module to extract information from Word files using Python. One solution, (there are many), from etienne is a very basic version of docx
which may remove the hexadecimal numbers that you are getting. However, like SebasSBM's answer, it won't work for other features, such as tables etc.
If that still doesn't work, I would suggest looking at these answers; maybe one of them will still be relevant to your new docx format.
Upvotes: 0