Reputation: 1828
With spire.doc
, we can either extract all text from a file like
document = Document()
document.LoadFromFile(str(file))
document_text = document.GetText()
But doing this, we lose the chance to perform more subtle operations for the tables.
I am trying to convert the table into JSON format so all the information is retained) and put this JSON into the original position in the document.
The problem is, if I extract at paragraph level and table level, it seems that they respectively saved in two places and the position information of the tables are lost.
How can we find the position information back for the tables against these paragraphs (like table A is between paragraphs with index 10 and 11).
from spire.doc import *
from spire.doc.common import *
file_path = r'./file.docx'
document = Document()
document.LoadFromFile(file_path)
document.Sections.Count
section = document.Sections[0]
section.Paragraphs.Count
paragraph = section.Paragraphs[0]
section.Tables.Count
table = section.Tables[0]
Upvotes: 0
Views: 168
Reputation: 1003
The following code preserves the positions of tables during text extraction and also enables further operations to be performed on the table data.
from spire.doc import *
from spire.doc.common import *
def extract_table_data(table):
table_data = []
for r in range(table.Rows.Count):
row_data = []
for c in range(table.Rows[r].Cells.Count):
cell_text = ""
for n in range(table.Rows[r].Cells[c].Paragraphs.Count):
cell_text += table.Rows[r].Cells[c].Paragraphs[n].Text.strip()
row_data.append(cell_text)
table_data.append(row_data)
return table_data
def extract_paragraph_text(paragraph):
return paragraph.Text
try:
with Document() as doc:
doc.LoadFromFile("test.docx")
for i in range(doc.Sections.Count):
body = doc.Sections[i].Body
for j in range(body.ChildObjects.Count):
doc_obj = body.ChildObjects[j]
# Extract table data
if isinstance(doc_obj, Table):
table_data = extract_table_data(doc_obj)
for row in table_data:
print(" | ".join(row))
# Extract paragraph data
if isinstance(doc_obj, Paragraph):
paragraph_text = extract_paragraph_text(doc_obj)
print(paragraph_text)
except Exception as e:
print(f"Error: {e}")
Upvotes: 1