extract multiple tables from spreadsheet using python

Question

I would like to extract multiple tables of a series of excel spreadsheets where some sheets may contain more than one table, to store the tables separately as e.g. csv files. The table could be something like this:

if I read it using pandas read_excel

import pandas as pd
pd.read_excel('table_example.xlsx',header=None)

I would get something like this:

how could I extract the different tables? In my case tables have NaN values, which is potentially an additional complication.

[EDIT1] something similar to the excel tables can be generated using pandas:

df=pd.DataFrame(np.nan,index=range(0,10),columns=range(0,10))
df.iloc[1,2:5]=['t1h1','t1h2','t1h3']
df.iloc[2:5,2:5]=np.random.randn(3,3)
df.iloc[6,3:7]=['t2h1','t2h2','t2h3','t2h4']
df.iloc[7:9,3:7]=np.random.randn(2,4)

I have tried to find the limits of the tables using built-in pandas functions:

df[df.isnull().all(axis=1)]

I could use the first and second row to set an horizontal division and maybe make a first split but I don't know how to select the cells above or below the identified rows. Or even if this is the most convenient approach.

disclaimer: in my case the tables always have a tag in the row above the header, this is because these tables are read by a non-python software which uses them to identify where the table starts. I decided to not consider these tags to ask a more generic problem, that other people may encounter.

gxpr · Accepted Answer

import numpy as np
import pandas as pd

# I have assumed that the tables are "separated" by at least one row with only NaN values

df=pd.DataFrame(np.nan,index=range(0,10),columns=range(0,10))
df.iloc[1,2:5]=['t1h1','t1h2','t1h3']
df.iloc[2:5,2:5]=np.random.randn(3,3)
df.iloc[6,3:7]=['t2h1','t2h2','t2h3','t2h4']
df.iloc[7:9,3:7]=np.random.randn(2,4)

print(df)

# Extract by rows

nul_rows = list(df[df.isnull().all(axis=1)].index)

list_of_dataframes = []
for i in range(len(nul_rows) - 1):
    list_of_dataframes.append(df.iloc[nul_rows[i]+1:nul_rows[i+1],:])


# Remove null columns

cleaned_tables = []
for _df in list_of_dataframes:
    cleaned_tables.append(_df.dropna(axis=1, how='all'))

# cleaned_tables is a list of the dataframes

print(cleaned_tables[0])
print(cleaned_tables[1])

extract multiple tables from spreadsheet using python

Answers (2)

Related Questions