Reputation: 1
I have been trying to convert a number of DOCX files into TXT.
It works for a single file using the code below:
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
if __name__ == '__main__':
filename='/content/drive/My Drive/path/file.DOCX'; #file name
fullText=getText(filename)
print (fullText)
file = open("copy.txt", "w")
file.write(fullText)
file.close()
I tried different options (i.e. glob) but did not manage get it to do the above operation on all files in a folder.
Ideally the output should be 1 large text file and not separate ones. I will need to do some formatting and assigning of IDs in that file in a next step.
Thank you for your help! corp-alt
Upvotes: 0
Views: 1812
Reputation: 2343
With file = open("copy.txt", "w")
you open the file and replace its content with write()
.
With file = open("copy.txt", "a")
you append to the existing file with write()
. Or maybe even better:
With file = open("copy.txt", "a+")
you append to an existing file with write()
, or create a new file if it doesn't exist yet.
To go through all files in a folder you can loop over them:
import os
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
if __name__ == '__main__':
foldername='/content/drive/My Drive/path/'; #folder name
all_files = os.listdir(foldername) #get all filenames
docx_files = [ filename for filename in all_files if filename.endswith('.docx') ] #get .docx filenames
file = open("copy.txt", "a+")
for docx_file in docx_files: #loop over .docx files
fullText=getText(filename)
file.write(fullText)
file.close()
Upvotes: 1