Stefan Urziceanu
Stefan Urziceanu

Reputation: 245

How do I extract data from a doc/docx file using Python

I know there are similar questions out there, but I couldn't find something that would answer my prayers. What I need is a way to access certain data from MS-Word files and save it in an XML file. Reading up on python-docx did not help, as it only seems to allow one to write into word documents, rather than read. To present my task exactly (or how i chose to approach my task): I would like to search for a key word or phrase in the document (the document contains tables) and extract text data from the table where the key word/phrase is found. Anybody have any ideas?

Upvotes: 11

Views: 55950

Answers (7)

Omer Hayun
Omer Hayun

Reputation: 1

I built new docx parsing and conversion library to work with docx files. The parsing is mostly used for the conversion to WYSIWYG HTML and txt formats, but is useful for also extracting information from the docx in a more pythonic way.

To answer this question specifically, this is how you would use the library:

from docx_parser_converter.docx_parsers.utils import read_binary_from_file_path
from docx_parser_converter.docx_parsers.document.document_parser import DocumentParser
from docx_parser_converter.docx_parsers.models.table_models import Table
from docx_parser_converter.docx_parsers.models.paragraph_models import Paragraph
import json

def search_keyword_in_table(docx_file_content, keyword):
    parser = DocumentParser(docx_file_content)
    doc_schema = parser.get_document_schema()
    
    for element in doc_schema.elements:
        if isinstance(element, Table):
            for row in element.rows:
                for cell in row.cells:
                    for paragraph in cell:
                        if isinstance(paragraph, Paragraph):
                            for run in paragraph.runs:
                                if keyword.lower() in run.text.lower():
                                    return element
    return None

docx_path = "path/to/you/docx/file"
keyword = "your_keyword_or_phrase_here"

docx_file_content = read_binary_from_file_path(docx_path)
table_with_keyword = search_keyword_in_table(docx_file_content, keyword)

if table_with_keyword:
    filtered_schema_dict = table_with_keyword.model_dump(exclude_none=True)
    print(json.dumps(filtered_schema_dict, indent=2))
else:
    print("Keyword not found in any table.")

I might add a more straightforward search function in the future if people would ask for it.

Upvotes: 0

Alexey Noskov
Alexey Noskov

Reputation: 1997

You can use Aspose.Words to read the document. When document is loaded into Aspose.Words Document object is is represented as as DOM, which you can read programmatically.

For example the following code reads the first table table in the document and checks the first cell of each row as a key:

import aspose.words as aw

key = "Name"

doc = aw.Document("C:\\Temp\\in.docx")
table = doc.first_section.body.tables[0]
for r in table.rows :
    row = r.as_row()
    first_cell_text = row.first_cell.to_string(aw.SaveFormat.TEXT).strip()
    if first_cell_text == key :
        print(row.cells[1].to_string(aw.SaveFormat.TEXT).strip())

Upvotes: 0

dataninsight
dataninsight

Reputation: 1343

Extracting text from doc/docx file using python

import os
import docx2txt
from win32com import client as wc

def extract_text_from_docx(path):
    temp = docx2txt.process(path)
    text = [line.replace('\t', ' ') for line in temp.split('\n') if line]
    final_text = ' '.join(text)
    return final_text

def extract_text_from_doc(doc_path):
    w = wc.Dispatch('Word.Application')
    doc = w.Documents.Open(file_path)
    doc.SaveAs(save_file_name, 16)
    doc.Close()
    w.Quit()
    joinedPath = os.path.join(root_path,save_file_name)
    text = extract_text_from_docx(joinedPath)
    return text

def extract_text(file_path, extension):
    text = ''
    if extension == '.docx':
       text = extract_text_from_docx(file_path)
    else extension == '.doc':
       text = extract_text_from_doc(file_path)
return text

file_path = #file_path with doc/docx file
root_path = #file_path where the doc downloaded
save_file_name = "Final2_text_docx.docx"
final_text = extract_text(file_path, extension)
print(final_text)

Upvotes: 1

Mike Robins
Mike Robins

Reputation: 1773

The docx is a zip file containing an XML of the document. You can open the zip, read the document and parse data using ElementTree.

The advantage of this technique is that you don't need any extra python libraries installed.

import zipfile
import xml.etree.ElementTree

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
TABLE = WORD_NAMESPACE + 'tbl'
ROW = WORD_NAMESPACE + 'tr'
CELL = WORD_NAMESPACE + 'tc'

with zipfile.ZipFile('<path to docx file>') as docx:
    tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))

for table in tree.iter(TABLE):
    for row in table.iter(ROW):
        for cell in row.iter(CELL):
            print ''.join(node.text for node in cell.iter(TEXT))

See my stackoverflow answer to How to read contents of an Table in MS-Word file Using Python? for more details and references.

In answer to a comment below, Images are not as clear cut to extract. I have created an empty docx and inserted one image into it. I then open the docx file as a zip archive (using 7zip) and looked at the document.xml. All the image information is stored as attributes in the XML not the CDATA like the text is. So you need to find the tag you are interested in and pull out the information that you are looking for.

For example adding to the script above:

IMAGE = '{http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing}' + 'docPr'

for image in tree.iter(IMAGE):
    print image.attrib

outputs:

{'id': '1', 'name': 'Picture 1'}

I'm no expert at the openxml format but I hope this helps.

I do note that the zip file contains a directory called media which contains a file called image1.jpeg that contains a renamed copy of my embedded image. You can look around in the docx zipfile to investigate what is available.

Upvotes: 18

Krissh
Krissh

Reputation: 357

A more simple library with image extraction capability.

pip install docx2txt


Then use below code to read docx file.

import docx2txt
text = docx2txt.process("file.docx")

Upvotes: 0

Stefan Urziceanu
Stefan Urziceanu

Reputation: 245

It seems that pywin32 does the trick. You can iterate through all the tables in a document and through all the cells inside a table. It's a bit tricky to get the data (the last 2 characters from every entry have to be omitted), but otherwise, it's a ten minute code. If anyone needs additional details, please say so in the comments.

Upvotes: -1

edi9999
edi9999

Reputation: 20574

To search in a document with python-docx

# Import the module
from docx import *

# Open the .docx file
document = opendocx('A document.docx')

# Search returns true if found    
search(document,'your search string')

You also have a function to get the text of a document:

https://github.com/mikemaccana/python-docx/blob/master/docx.py#L910

# Import the module
from docx import *

# Open the .docx file
document = opendocx('A document.docx')
fullText=getdocumenttext(document)

Using https://github.com/mikemaccana/python-docx

Upvotes: 5

Related Questions