Reputation: 71
Example for a docx text:
after running this block of code
dir_data ="some_dir"
import docx2txt
for file in os.listdir(dir_data):
with open(os.path.join(dir_data,file), 'rb') as infile:
with open(file[:-5]+'.txt', 'w', encoding='utf-8') as outfile:
doc = docx2txt.process(infile)
outfile.write(doc)
It return a txt file but the text is without the head number - "Hello word" and without a subclause number -"My name is max"
how to fix it?
Upvotes: 1
Views: 325
Reputation: 71
Thanks to everybody! I found the solution
import aspose.words as aw
# Load DOC file
doc = aw.Document(r"some_dir")
# Save DOC as TXT
doc.save("doc-to-text.txt")
Upvotes: 1
Reputation: 2669
I tested it, it works on my setup.
Code:
import os
from docx import Document
dir_data = "C:\\Users\\<username>\\Desktop\\test" # test directory on Desktop
for file in os.listdir(dir_data):
if file.endswith(".docx"):
docx_path = os.path.join(dir_data, file)
txt_path = os.path.splitext(docx_path)[0] + '.txt'
document = Document(docx_path)
with open(txt_path, 'w', encoding='utf-8') as outfile:
for i, paragraph in enumerate(document.paragraphs, start=1):
text = f"{i}. {paragraph.text.strip()}"
if text: # Ignore empty paragraphs
outfile.write(text + '\n')
Output:
Upvotes: 0
Reputation: 50949
You can do it with docx2python
from docx2python import docx2python
dir_data ="some_dir"
for file in os.listdir(dir_data):
doc = docx2python(os.path.join(dir_data,file))
with open('file.txt', 'a', encoding='utf-8') as outfile:
outfile.write(doc.text.replace(')', '.'))
Upvotes: 0