Xhoan
Xhoan

Reputation: 330

Split dataframe by certain values in first column?

I have a dataframe like this one:

A          C1   C2   Total 

PRODUCT1   8    11   19
rs1        5    9    14    
rs2        2    2    4 
rs3        1    0    1
PRODUCT2   21   12   33
rs7        11   7    18    
rs2        7    3    10
rs1        3    1    4
rs9        0    1    1
PRODUCT3   2    11   13
rs9        1    6    7    
rs5        1    5    6

The column A is made of strings, I want to split my dataframe by the values in this column, specifically every upper word in it. Like this:

df1 = 

PRODUCT1   8    11   19
rs1        5    9    14    
rs2        2    2    4 
rs3        1    0    1

df2 = 

PRODUCT2   21   12   33
rs7        11   7    18    
rs2        7    3    10
rs1        3    1    4
rs9        0    1    1

df3 = 

PRODUCT3   2    11   13
rs9        1    6    7    
rs5        1    5    6

Is there an easy way to achieve this?

Upvotes: 2

Views: 354

Answers (1)

unutbu
unutbu

Reputation: 880369

import pandas as pd
df = pd.DataFrame({'A': ['PRODUCT1', 'rs1', 'rs2', 'rs3', 'PRODUCT2', 'rs7', 'rs2', 'rs1', 'rs9', 'PRODUCT3', 'rs9', 'rs5'], 'C1': [8, 5, 2, 1, 21, 11, 7, 3, 0, 2, 1, 1], 'C2': [11, 9, 2, 0, 12, 7, 3, 1, 1, 11, 6, 5], 'Total': [19, 14, 4, 1, 33, 18, 10, 4, 1, 13, 7, 6]}) 

for key, group in df.groupby(df['A'].str.isupper().cumsum()):
    print(group)

prints

          A  C1  C2  Total
0  PRODUCT1   8  11     19
1       rs1   5   9     14
2       rs2   2   2      4
3       rs3   1   0      1
          A  C1  C2  Total
4  PRODUCT2  21  12     33
5       rs7  11   7     18
6       rs2   7   3     10
7       rs1   3   1      4
8       rs9   0   1      1
           A  C1  C2  Total
9   PRODUCT3   2  11     13
10       rs9   1   6      7
11       rs5   1   5      6

The idea here is to identify rows which are uppercase:

In [95]: df['A'].str.isupper()
Out[95]: 
0      True
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9      True
10    False
11    False
Name: A, dtype: bool

then use cumsum to take a cumulative sum, where True is treated as 1 and False is treated as 0:

In [96]: df['A'].str.isupper().cumsum()
Out[96]: 
0     1
1     1
2     1
3     1
4     2
5     2
6     2
7     2
8     2
9     3
10    3
11    3
Name: A, dtype: int64

These values can be used as group numbers. Pass them to df.groupby to group the DataFrame according to these group numbers. df.groupby(...) returns an iterable, which lets you loop through the sub-groups.

Upvotes: 4

Related Questions