Max Melichov
Max Melichov

Reputation: 71

how to read the numbers in the start of the line from docx files in python?

Example for a docx text:

enter image description here

after running this block of code

dir_data ="some_dir"
import docx2txt
 for file in os.listdir(dir_data):  
    with open(os.path.join(dir_data,file), 'rb') as infile:
        with open(file[:-5]+'.txt', 'w', encoding='utf-8') as outfile:
            doc = docx2txt.process(infile)
            outfile.write(doc)

It return a txt file but the text is without the head number - "Hello word" and without a subclause number -"My name is max"

how to fix it?

Upvotes: 1

Views: 325

Answers (3)

Max Melichov
Max Melichov

Reputation: 71

Thanks to everybody! I found the solution

import aspose.words as aw

# Load DOC file
doc = aw.Document(r"some_dir")

# Save DOC as TXT
doc.save("doc-to-text.txt")

Upvotes: 1

Ömer Sezer
Ömer Sezer

Reputation: 2669

I tested it, it works on my setup.

Code:

import os
from docx import Document

dir_data = "C:\\Users\\<username>\\Desktop\\test"  # test directory on Desktop

for file in os.listdir(dir_data):
    if file.endswith(".docx"):
        docx_path = os.path.join(dir_data, file)
        txt_path = os.path.splitext(docx_path)[0] + '.txt'

        document = Document(docx_path)
        with open(txt_path, 'w', encoding='utf-8') as outfile:
            for i, paragraph in enumerate(document.paragraphs, start=1):
                text = f"{i}. {paragraph.text.strip()}"
                if text:  # Ignore empty paragraphs
                    outfile.write(text + '\n')

Output:

enter image description here

Upvotes: 0

Guy
Guy

Reputation: 50949

You can do it with docx2python

from docx2python import docx2python

dir_data ="some_dir"
for file in os.listdir(dir_data):
    doc = docx2python(os.path.join(dir_data,file))
    with open('file.txt', 'a', encoding='utf-8') as outfile:
        outfile.write(doc.text.replace(')', '.'))

Upvotes: 0

Related Questions