user3511563
user3511563

Reputation: 397

reading docx with python2.7

I'm trying to read a docx file with file with the following code:

from docx import Document
doc = Document('test.docx')

But when I try to print it, i get this:

<docx.api.Document object at 0x02952C70>

How can I read the content inside the file?

I read that docx changed recently so, the old questions/answers don't apply anymore.

Upvotes: 1

Views: 4231

Answers (2)

SebasSBM
SebasSBM

Reputation: 908

Check out the structure of the Document object here:

Source code for docx.api

For example, if you want to get the property "paragraphs":

doc = Document('test.docx')
paragraphs = doc.paragraphs()

I hope this will help.

EDIT: I have found this snippet in the python-docx's gitHub repository and edited it a little here:

document = docx.Document(filename)
docText = '\n\n'.join([
    paragraph.text.encode('utf-8') for paragraph in document.paragraphs
])
print docText

The join() function receives a list of strings encoded in UTF-8 from the paragraphs in the array returned by paragraphs property. So the result would look like:

paragraph 1

paragraph 2

paragraph 3

It looks like this works, but it doesn't print tables, headers or footers.

EDIT: This link is the main index for all documentation about python-docx:

python-docx 0.7.4 documentation

Upvotes: 4

TheDarkTurtle
TheDarkTurtle

Reputation: 413

It is possible to not use the docx module to extract information from Word files using Python. One solution, (there are many), from etienne is a very basic version of docx which may remove the hexadecimal numbers that you are getting. However, like SebasSBM's answer, it won't work for other features, such as tables etc.

If that still doesn't work, I would suggest looking at these answers; maybe one of them will still be relevant to your new docx format.

Upvotes: 0

Related Questions