mfastudillo
mfastudillo

Reputation: 1631

extract multiple tables from spreadsheet using python

I would like to extract multiple tables of a series of excel spreadsheets where some sheets may contain more than one table, to store the tables separately as e.g. csv files. The table could be something like this:

enter image description here

if I read it using pandas read_excel

import pandas as pd
pd.read_excel('table_example.xlsx',header=None)

I would get something like this:

enter image description here

how could I extract the different tables? In my case tables have NaN values, which is potentially an additional complication.

[EDIT1] something similar to the excel tables can be generated using pandas:

df=pd.DataFrame(np.nan,index=range(0,10),columns=range(0,10))
df.iloc[1,2:5]=['t1h1','t1h2','t1h3']
df.iloc[2:5,2:5]=np.random.randn(3,3)
df.iloc[6,3:7]=['t2h1','t2h2','t2h3','t2h4']
df.iloc[7:9,3:7]=np.random.randn(2,4)

I have tried to find the limits of the tables using built-in pandas functions:

df[df.isnull().all(axis=1)]

I could use the first and second row to set an horizontal division and maybe make a first split but I don't know how to select the cells above or below the identified rows. Or even if this is the most convenient approach.

disclaimer: in my case the tables always have a tag in the row above the header, this is because these tables are read by a non-python software which uses them to identify where the table starts. I decided to not consider these tags to ask a more generic problem, that other people may encounter.

Upvotes: 5

Views: 11689

Answers (2)

Roshan
Roshan

Reputation: 51

This may help to dynamically locate and extract the table as long as the 2 tables are separated by either a row or column of NaNs.

I used the boundingbox solution from https://stackoverflow.com/a/54675526

from skimage.measure import label, regionprops
#this basically converts your table into 0s and 1s where 0 is NaN and 1 for non NaN 
binary_rep = np.array(df.notnull().astype('int'))

list_of_dataframes = []
l = label(binary_rep)
for s in regionprops(l):
    #the bbox contains the extremes of the bounding box. So the top left and bottom right cell locations of the table.
    list_of_dataframes.append(df.iloc[s.bbox[0]:s.bbox[2],s.bbox[1]:s.bbox[3]])

Upvotes: 5

gxpr
gxpr

Reputation: 856

import numpy as np
import pandas as pd

# I have assumed that the tables are "separated" by at least one row with only NaN values

df=pd.DataFrame(np.nan,index=range(0,10),columns=range(0,10))
df.iloc[1,2:5]=['t1h1','t1h2','t1h3']
df.iloc[2:5,2:5]=np.random.randn(3,3)
df.iloc[6,3:7]=['t2h1','t2h2','t2h3','t2h4']
df.iloc[7:9,3:7]=np.random.randn(2,4)

print(df)

# Extract by rows

nul_rows = list(df[df.isnull().all(axis=1)].index)

list_of_dataframes = []
for i in range(len(nul_rows) - 1):
    list_of_dataframes.append(df.iloc[nul_rows[i]+1:nul_rows[i+1],:])


# Remove null columns

cleaned_tables = []
for _df in list_of_dataframes:
    cleaned_tables.append(_df.dropna(axis=1, how='all'))

# cleaned_tables is a list of the dataframes

print(cleaned_tables[0])
print(cleaned_tables[1])

Upvotes: 4

Related Questions