How to select a single table in a xlsx with multiples tables with Pandas

Question

I have a xlsx file with multiples tables but for an analysis I need to work with just one table of them. How can I select a single table and work in it with pandas? In this example below I need to select and use the table (Modelo, ID, Tipo) as a dataframe.

Nomes ID Endereço
Pedro 1  Rua X
Joao  1  Rua Y
Maria 2  Rua X

Modelo ID Tipo
A      20 XYZ
B      30 XYS
C      40 XQZ

Cor Modelo Classe Local
Azul A        4     Casa
Vermelho A    4     Casa
Roxo A        5     Casa
Lilas A       X     Casa

I have read many documents but all says about working a single table in xlsx, my problem is to have many tables (around 18 tables) in a single xlsx file. Note: The tables will be the same headers always, but the items amount may be variable (it depends client per client)

MattR · Accepted Answer

This is a very simple way. Since you need the second table and there are spaces in each table we can tackle this in steps:

1 - Read in Entire Dataframe

2 - Find the blank rows

3 - Read in between the blank rows

import pandas as pd

# read in dataframe
# using skip_blank_lines=False makes sure to read all the nulls
# using usecols ensures we only use the first three columns
df = pd.read_excel(r'path_to_your.xlsx'
                   , skip_blank_lines=False, usecols='A:C')
      A    B    C
0     1    4    7
1     2    5    8
2     3    6    9
3   NaN  NaN  NaN
4     E    F    G
5     1    2    3
6     4    5    6
7   NaN  NaN  NaN
8     H    I    J
9     a    d    g
10    b    e    h
11    c    f    i
# get index of first entire NULL row, add 2 to correct skip these rows
first_val = df.loc[df.isna().all(axis=1)].index[0]+2

# re-read the file. Skiping the rows we just found the nulls for
df2 = pd.read_excel(r'path_to_your.xlsx', skiprows=first_val)

    E   F   G
0   1   2   3
1   4   5   6
2   NaN NaN NaN
3   H   I   J
4   a   d   g
5   b   e   h
6   c   f   i

# now just read up to the blank line
# using iloc, we use the first row (0) and go up until the first blank line
# since the first blank line is the end of the dataframe it will return our answer
df2.iloc[0:df2.loc[df2.isna().all(axis=1)].index[0]]

    E   F   G
0   1   2   3
1   4   5   6

How to select a single table in a xlsx with multiples tables with Pandas

Answers (2)

Related Questions