John
John

Reputation: 1828

With spire.doc in Python, how to determine a table is positioned between which two paragraphs

With spire.doc, we can either extract all text from a file like

document = Document()
document.LoadFromFile(str(file))
document_text = document.GetText()

But doing this, we lose the chance to perform more subtle operations for the tables.

I am trying to convert the table into JSON format so all the information is retained) and put this JSON into the original position in the document.

The problem is, if I extract at paragraph level and table level, it seems that they respectively saved in two places and the position information of the tables are lost.

How can we find the position information back for the tables against these paragraphs (like table A is between paragraphs with index 10 and 11).

from spire.doc import *
from spire.doc.common import *

file_path = r'./file.docx'

document = Document()
document.LoadFromFile(file_path)

document.Sections.Count

section = document.Sections[0]

section.Paragraphs.Count

paragraph = section.Paragraphs[0]

section.Tables.Count

table = section.Tables[0]

Upvotes: 0

Views: 168

Answers (1)

Dheeraj Malik
Dheeraj Malik

Reputation: 1003

The following code preserves the positions of tables during text extraction and also enables further operations to be performed on the table data.

from spire.doc import *
from spire.doc.common import *

def extract_table_data(table):
    table_data = []
    for r in range(table.Rows.Count):
        row_data = []
        for c in range(table.Rows[r].Cells.Count):
            cell_text = ""
            for n in range(table.Rows[r].Cells[c].Paragraphs.Count):
                cell_text += table.Rows[r].Cells[c].Paragraphs[n].Text.strip()
            row_data.append(cell_text)
        table_data.append(row_data)
    return table_data

def extract_paragraph_text(paragraph):
    return paragraph.Text

try:
    with Document() as doc:
        doc.LoadFromFile("test.docx")

        for i in range(doc.Sections.Count):
            body = doc.Sections[i].Body
            for j in range(body.ChildObjects.Count):
                doc_obj = body.ChildObjects[j]
                
                # Extract table data
                if isinstance(doc_obj, Table):
                    table_data = extract_table_data(doc_obj)
                    for row in table_data:
                        print(" | ".join(row))

                # Extract paragraph data
                if isinstance(doc_obj, Paragraph):
                    paragraph_text = extract_paragraph_text(doc_obj)
                    print(paragraph_text)

except Exception as e:
    print(f"Error: {e}")

Result: extract text from word in python

Upvotes: 1

Related Questions