Lwgs
Lwgs

Reputation: 3

Extracting tables from Word into Excel- how to keep the tables separate?

Trying to export tables from within .docx (of different sizes in terms of both number of columns and rows) into Excel

I can get the document to transfer the tables into an excel, but it joins the two tables together. Is there a way I can get the tables to either be separated on the same sheet, or on different worksheets (either is fine)?

Code as below:

pip install python-docx
import pandas as pd

from docx import Document
path = (r"PATH\Practice_Tables.docx")

df = pd.DataFrame()

doc = Document(path)
for table in doc.tables:
    for row in table.rows:
        row_text = [c.text for c in row.cells]
        df = df.append([row_text], ignore_index=True)
        
print(df)

                  0                     1               2
0    Table header 1        Table header 2  Table header 3
1              r1c1                  r1c2            r1c3
2              r2c1                  r2c2            r2c3
3  Practice Table 2  Practice table col 2             NaN
4             Row 2                Row 2a             NaN

writer = pd.ExcelWriter('PracticeTables3.xlsx', engine='xlsxwriter')

df.to_excel(writer, sheet_name='Sheet1')

writer.save()

Update

Error from single sheet:

NameError                                 Traceback (most recent call last)
<ipython-input-7-d2e9fa27f104> in <module>
      9     print(df)
     10 
---> 11     df.to_excel(writer, sheet_name='Sheet1', startrow=startrow)
     12     startrow += len(df)+2
     13 

NameError: name 'startrow' is not defined

Upvotes: 0

Views: 384

Answers (1)

norie
norie

Reputation: 9857

You're current code is adding all the tables into one dataframe.

You need to create a separate dataframe for each table, write it to the Excel file and move on.

Separate sheets

This will write each table to a separate sheet in an Excel file.

import pandas as pd
from docx import Document

path = 'Practice_Tables.docx'

sheet_no=0

writer = pd.ExcelWriter('PracticeTables3.xlsx', engine='xlsxwriter')
doc = Document(path)
for table in doc.tables:
    df = pd.DataFrame()
    for row in table.rows:
        row_text = [c.text for c in row.cells]
        df = df.append([row_text], ignore_index=True)
    
    print(df)
    sheet_no += 1 
    df.to_excel(writer, sheet_name=f'Sheet{sheet_no}')

writer.save()

Single sheet

This code will write the tables to the same sheet with rows inbetween.

import pandas as pd
from docx import Document

path = 'Practice_Tables.docx'

startrow=0

writer = pd.ExcelWriter('PracticeTables3.xlsx', engine='xlsxwriter')
doc = Document(path)
for table in doc.tables:
    df = pd.DataFrame()
    for row in table.rows:
        row_text = [c.text for c in row.cells]
        df = df.append([row_text], ignore_index=True)
    
    print(df)
   
    df.to_excel(writer, sheet_name='Sheet1', startrow=startrow)
    startrow += len(df)+2
    
writer.save()

Upvotes: 1

Related Questions