Lucas Cruz
Lucas Cruz

Reputation: 19

How to select a single table in a xlsx with multiples tables with Pandas

I have a xlsx file with multiples tables but for an analysis I need to work with just one table of them. How can I select a single table and work in it with pandas? In this example below I need to select and use the table (Modelo, ID, Tipo) as a dataframe.

Nomes ID Endereço
Pedro 1  Rua X
Joao  1  Rua Y
Maria 2  Rua X

Modelo ID Tipo
A      20 XYZ
B      30 XYS
C      40 XQZ

Cor Modelo Classe Local
Azul A        4     Casa
Vermelho A    4     Casa
Roxo A        5     Casa
Lilas A       X     Casa

I have read many documents but all says about working a single table in xlsx, my problem is to have many tables (around 18 tables) in a single xlsx file. Note: The tables will be the same headers always, but the items amount may be variable (it depends client per client)

Upvotes: 0

Views: 1101

Answers (2)

MattR
MattR

Reputation: 5146

This is a very simple way. Since you need the second table and there are spaces in each table we can tackle this in steps:

1 - Read in Entire Dataframe

2 - Find the blank rows

3 - Read in between the blank rows

import pandas as pd

# read in dataframe
# using skip_blank_lines=False makes sure to read all the nulls
# using usecols ensures we only use the first three columns
df = pd.read_excel(r'path_to_your.xlsx'
                   , skip_blank_lines=False, usecols='A:C')
      A    B    C
0     1    4    7
1     2    5    8
2     3    6    9
3   NaN  NaN  NaN
4     E    F    G
5     1    2    3
6     4    5    6
7   NaN  NaN  NaN
8     H    I    J
9     a    d    g
10    b    e    h
11    c    f    i
# get index of first entire NULL row, add 2 to correct skip these rows
first_val = df.loc[df.isna().all(axis=1)].index[0]+2

# re-read the file. Skiping the rows we just found the nulls for
df2 = pd.read_excel(r'path_to_your.xlsx', skiprows=first_val)

    E   F   G
0   1   2   3
1   4   5   6
2   NaN NaN NaN
3   H   I   J
4   a   d   g
5   b   e   h
6   c   f   i

# now just read up to the blank line
# using iloc, we use the first row (0) and go up until the first blank line
# since the first blank line is the end of the dataframe it will return our answer
df2.iloc[0:df2.loc[df2.isna().all(axis=1)].index[0]]

    E   F   G
0   1   2   3
1   4   5   6

Upvotes: 2

Daniel
Daniel

Reputation: 3527

The script below will create separate dataframes for each table in a single excel sheet:

import pandas as pd
import numpy as np    

# read data from excel:
data = pd.read_excel('path/to/my/file.xlsx', header=None)

# split on rows with NaN (blank rows):
df_list = np.split(data, data[data.isnull().all(1)].index)

# iterate over each dataframe:
for df in df_list:

    # remove NaNs:
    df = df.dropna()

    # set first row as column headers:
    new_header = df.iloc[0] 
    df = df[1:] 
    df.columns = new_header 

    # clean up dataframe:
    df = df.reset_index()
    df = df.drop(['index'], axis=1)
    df.columns.name = ''

    # print results:
    print(df)
    print()

Will output:

   Nomes                   ID Endereço
0  Pedro  1900-01-01 00:00:00    Rua X
1   Joao  1900-01-01 00:00:00    Rua Y
2  Maria  1900-01-02 00:00:00    Rua X

  Modelo  ID Tipo
0      A  20  XYZ
1      B  30  XYS
2      C  40  XQZ

   Cor Modelo Classe Local
0      Azul A      4  Casa
1  Vermelho A      4  Casa
2      Roxo A      5  Casa
3      Lilas       X  Casa

Upvotes: 0

Related Questions