Reputation: 19
I have a xlsx file with multiples tables but for an analysis I need to work with just one table of them. How can I select a single table and work in it with pandas? In this example below I need to select and use the table (Modelo, ID, Tipo) as a dataframe.
Nomes ID Endereço
Pedro 1 Rua X
Joao 1 Rua Y
Maria 2 Rua X
Modelo ID Tipo
A 20 XYZ
B 30 XYS
C 40 XQZ
Cor Modelo Classe Local
Azul A 4 Casa
Vermelho A 4 Casa
Roxo A 5 Casa
Lilas A X Casa
I have read many documents but all says about working a single table in xlsx, my problem is to have many tables (around 18 tables) in a single xlsx file. Note: The tables will be the same headers always, but the items amount may be variable (it depends client per client)
Upvotes: 0
Views: 1101
Reputation: 5146
This is a very simple way. Since you need the second table and there are spaces in each table we can tackle this in steps:
1 - Read in Entire Dataframe
2 - Find the blank rows
3 - Read in between the blank rows
import pandas as pd
# read in dataframe
# using skip_blank_lines=False makes sure to read all the nulls
# using usecols ensures we only use the first three columns
df = pd.read_excel(r'path_to_your.xlsx'
, skip_blank_lines=False, usecols='A:C')
A B C
0 1 4 7
1 2 5 8
2 3 6 9
3 NaN NaN NaN
4 E F G
5 1 2 3
6 4 5 6
7 NaN NaN NaN
8 H I J
9 a d g
10 b e h
11 c f i
# get index of first entire NULL row, add 2 to correct skip these rows
first_val = df.loc[df.isna().all(axis=1)].index[0]+2
# re-read the file. Skiping the rows we just found the nulls for
df2 = pd.read_excel(r'path_to_your.xlsx', skiprows=first_val)
E F G
0 1 2 3
1 4 5 6
2 NaN NaN NaN
3 H I J
4 a d g
5 b e h
6 c f i
# now just read up to the blank line
# using iloc, we use the first row (0) and go up until the first blank line
# since the first blank line is the end of the dataframe it will return our answer
df2.iloc[0:df2.loc[df2.isna().all(axis=1)].index[0]]
E F G
0 1 2 3
1 4 5 6
Upvotes: 2
Reputation: 3527
The script below will create separate dataframes for each table in a single excel sheet:
import pandas as pd
import numpy as np
# read data from excel:
data = pd.read_excel('path/to/my/file.xlsx', header=None)
# split on rows with NaN (blank rows):
df_list = np.split(data, data[data.isnull().all(1)].index)
# iterate over each dataframe:
for df in df_list:
# remove NaNs:
df = df.dropna()
# set first row as column headers:
new_header = df.iloc[0]
df = df[1:]
df.columns = new_header
# clean up dataframe:
df = df.reset_index()
df = df.drop(['index'], axis=1)
df.columns.name = ''
# print results:
print(df)
print()
Will output:
Nomes ID Endereço
0 Pedro 1900-01-01 00:00:00 Rua X
1 Joao 1900-01-01 00:00:00 Rua Y
2 Maria 1900-01-02 00:00:00 Rua X
Modelo ID Tipo
0 A 20 XYZ
1 B 30 XYS
2 C 40 XQZ
Cor Modelo Classe Local
0 Azul A 4 Casa
1 Vermelho A 4 Casa
2 Roxo A 5 Casa
3 Lilas X Casa
Upvotes: 0