Reputation: 1723
I have a problem when extracting the .docx file from Microsoft Word. I want to handle the data extracted by using python-docx library. When I want to scrap the file. I have a problem that a single paragraph turns into multiple paragraphs, but on my document file, I set it as a single paragraph with Bold text.
This is my script
import io
import re
import csv
import pandas as pd
# import win32com.client
import docx2txt
from docx import Document
industri_minuman = "test.docx"
for idx, para in enumerate(doc_industri_minuman.paragraphs):
for run in para.runs:
print(run.text)
This is the result from script
AGA, AMDK / INTAN MULIA, CV
Produk Utama: Air Mineral Aga
Tenaga Kerja: 20 - 99 Orang
Telepon: 82242000000
Email: -
Alamat Pabrik:
Ds
Gunungsari
Ds
Sumbergondo
,
Kec
Glenmore
Kab Banyuwangi
Expected Result:
AGA, AMDK / INTAN MULIA, CV
Produk Utama: Air Mineral Aga
Tenaga Kerja: 20 - 99 Orang
Telepon: 82242000000
Email: -
Alamat Pabrik: Ds Gunungsari Ds Sumbergondo,
Kec Glenmore Kab Banyuwangi
This is my attached of document file
Upvotes: 0
Views: 972
Reputation: 73
MS Word paragraphs are a little unpredictable, mainly because of the style. If you unzip a docx file and navigate to the content, we'll se a lot of tags like <w:r>, which is how MS Word handles the different styles in the doc.
A few days ago I develop a library that replaces text inside a document, and to do this I had to do some "magic" because of these breaks you mentioned. If you are interested, take a look here: https://github.com/ivanbicalho/python-docx-replace
Upvotes: 1