MADFROST
MADFROST

Reputation: 1723

How to handle new line to a single paragraph python docx

I have a problem when extracting the .docx file from Microsoft Word. I want to handle the data extracted by using python-docx library. When I want to scrap the file. I have a problem that a single paragraph turns into multiple paragraphs, but on my document file, I set it as a single paragraph with Bold text.

This is my script

import io
import re
import csv
import pandas as pd
# import win32com.client
import docx2txt
from docx import Document

industri_minuman = "test.docx"

for idx, para in enumerate(doc_industri_minuman.paragraphs):
    for run in para.runs:
        print(run.text)

This is the result from script

AGA, AMDK / INTAN MULIA, CV
Produk Utama: Air Mineral Aga
Tenaga Kerja: 20 - 99 Orang
Telepon: 82242000000
Email: -
Alamat Pabrik: 
Ds
 
Gunungsari
 
Ds
 
Sumbergondo
,
Kec
 
Glenmore
 Kab Banyuwangi

Expected Result:

AGA, AMDK / INTAN MULIA, CV
Produk Utama: Air Mineral Aga
Tenaga Kerja: 20 - 99 Orang
Telepon: 82242000000
Email: -
Alamat Pabrik: Ds Gunungsari Ds Sumbergondo,
Kec Glenmore Kab Banyuwangi

This is my attached of document file

Link to Docs File

Upvotes: 0

Views: 972

Answers (1)

Ivan Bicalho
Ivan Bicalho

Reputation: 73

MS Word paragraphs are a little unpredictable, mainly because of the style. If you unzip a docx file and navigate to the content, we'll se a lot of tags like <w:r>, which is how MS Word handles the different styles in the doc.

A few days ago I develop a library that replaces text inside a document, and to do this I had to do some "magic" because of these breaks you mentioned. If you are interested, take a look here: https://github.com/ivanbicalho/python-docx-replace

Upvotes: 1

Related Questions