ignoramus
ignoramus

Reputation: 23

Python - grab all text in .docx and dump into .txt

I am wondering how I would write a Python script to carry out the following set of steps: (1) open a typical .docx, (2) select all, (3) copy to clipboard, (4) store as a string.

I don't care about preserving any formatting, nor about graphics, nor about tables. I just want the text stored as a gigantic string, for parsing and analysis.

Upvotes: 1

Views: 3858

Answers (2)

jstuartmilne
jstuartmilne

Reputation: 4488

Since you are talking about a docx you could consider using python-docx https://python-docx.readthedocs.io/en/latest/

According to the documentation you could write something like this

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

To get all the text then using something like pyperclip you could copy it to clipboard. So without trying it i would imagine something like

import docx
import pyperclip

textInFile = getText("yourDoc.docx")
pyperclip.copy(textInFile)

https://github.com/asweigart/pyperclip

Upvotes: 1

JavierCastro
JavierCastro

Reputation: 328

There are libraries to help with this. Take a look at python-docx, which despite being oriented towards creating and updating docx files will allow you to read the contents of a document.

This answer HERE might help you start, but is by no means complete.

Here's a link to the python-docx documentation.

Upvotes: 0

Related Questions