Reputation: 1631
I would like to extract multiple tables of a series of excel spreadsheets where some sheets may contain more than one table, to store the tables separately as e.g. csv files. The table could be something like this:
if I read it using pandas read_excel
import pandas as pd
pd.read_excel('table_example.xlsx',header=None)
I would get something like this:
how could I extract the different tables? In my case tables have NaN values, which is potentially an additional complication.
[EDIT1] something similar to the excel tables can be generated using pandas:
df=pd.DataFrame(np.nan,index=range(0,10),columns=range(0,10))
df.iloc[1,2:5]=['t1h1','t1h2','t1h3']
df.iloc[2:5,2:5]=np.random.randn(3,3)
df.iloc[6,3:7]=['t2h1','t2h2','t2h3','t2h4']
df.iloc[7:9,3:7]=np.random.randn(2,4)
I have tried to find the limits of the tables using built-in pandas functions:
df[df.isnull().all(axis=1)]
I could use the first and second row to set an horizontal division and maybe make a first split but I don't know how to select the cells above or below the identified rows. Or even if this is the most convenient approach.
disclaimer: in my case the tables always have a tag in the row above the header, this is because these tables are read by a non-python software which uses them to identify where the table starts. I decided to not consider these tags to ask a more generic problem, that other people may encounter.
Upvotes: 5
Views: 11689
Reputation: 51
This may help to dynamically locate and extract the table as long as the 2 tables are separated by either a row or column of NaNs.
I used the boundingbox solution from https://stackoverflow.com/a/54675526
from skimage.measure import label, regionprops
#this basically converts your table into 0s and 1s where 0 is NaN and 1 for non NaN
binary_rep = np.array(df.notnull().astype('int'))
list_of_dataframes = []
l = label(binary_rep)
for s in regionprops(l):
#the bbox contains the extremes of the bounding box. So the top left and bottom right cell locations of the table.
list_of_dataframes.append(df.iloc[s.bbox[0]:s.bbox[2],s.bbox[1]:s.bbox[3]])
Upvotes: 5
Reputation: 856
import numpy as np
import pandas as pd
# I have assumed that the tables are "separated" by at least one row with only NaN values
df=pd.DataFrame(np.nan,index=range(0,10),columns=range(0,10))
df.iloc[1,2:5]=['t1h1','t1h2','t1h3']
df.iloc[2:5,2:5]=np.random.randn(3,3)
df.iloc[6,3:7]=['t2h1','t2h2','t2h3','t2h4']
df.iloc[7:9,3:7]=np.random.randn(2,4)
print(df)
# Extract by rows
nul_rows = list(df[df.isnull().all(axis=1)].index)
list_of_dataframes = []
for i in range(len(nul_rows) - 1):
list_of_dataframes.append(df.iloc[nul_rows[i]+1:nul_rows[i+1],:])
# Remove null columns
cleaned_tables = []
for _df in list_of_dataframes:
cleaned_tables.append(_df.dropna(axis=1, how='all'))
# cleaned_tables is a list of the dataframes
print(cleaned_tables[0])
print(cleaned_tables[1])
Upvotes: 4