Lehel Tompos
Lehel Tompos

Reputation: 9

How to split a dataframe dynamically

I have the following dataframe:

import pandas as pd
data = {'Name': ['Ankit', 'Amit', 'Aishwarya',
                 'Priyanka','Kovacs','Tompos'],
        'Age': [21, 19, 20, 18, 20, 19],
        'Stream': ['Math', 'Commerce', 'Arts',
                   'Biology', 'Math', 'Arts'],
        'Percentage': [88, 92, 95, 70, 80, 83]}
df = pd.DataFrame(data, columns=['Name', 'Age', 'Stream',
                                 'Percentage'])


    Name    Age Stream  Percentage
0   Ankit   21  Math    88
1   Amit    19  Commerce    92
2   Aishwarya   20  Arts    95
3   Priyanka    18  Biology 70
4   Kovacs  20  Math    80
5   Tompos  19  Arts    83

I would like to split the dataframe based on rows that contain Arts ins Stream column in the following way: the rows between the two rows that containing the Arts form a separate dataframe.

My code is the following:

first=df.iloc[0:2]
second=df.iloc[3:5]

But I would like to replace this with a code that takes into account if the index changes, applicable to larger dataframes and. Can someone help me with this?

Thanks!

Upvotes: 1

Views: 49

Answers (1)

mozway
mozway

Reputation: 262204

Set up a grouper with eq and cumsum, then groupby to split. Reverse the Series with [:::-1] to group the previous rows with each "Arts":

dfs = [g for _, g in df.groupby(df.loc[::-1, 'Stream'].eq('Arts').cumsum(),
                                sort=False)]

Output:

[        Name  Age    Stream  Percentage
 0      Ankit   21      Math          88
 1       Amit   19  Commerce          92
 2  Aishwarya   20      Arts          95,
        Name  Age   Stream  Percentage
 3  Priyanka   18  Biology          70
 4    Kovacs   20     Math          80
 5    Tompos   19     Arts          83]

Intermediates:

        Name  Age    Stream  Percentage  eq('Arts')  cumsum
0      Ankit   21      Math          88       False       2
1       Amit   19  Commerce          92       False       2
2  Aishwarya   20      Arts          95        True       2
3   Priyanka   18   Biology          70       False       1
4     Kovacs   20      Math          80       False       1
5     Tompos   19      Arts          83        True       1

And if you want to exclude the "Arts":

m = df.loc[::-1, 'Stream'].eq('Arts')
dfs = [g for _, g in df[~m].groupby(m.cumsum(), sort=False)]

Output:

[    Name  Age    Stream  Percentage
 0  Ankit   21      Math          88
 1   Amit   19  Commerce          92,
        Name  Age   Stream  Percentage
 3  Priyanka   18  Biology          70
 4    Kovacs   20     Math          80]

Upvotes: 1

Related Questions