ladybug
ladybug

Reputation: 109

Groupby a part of the string in pandas

I'm trying to build a new dataframe from grouping part of the string into a column.

import pandas

df = pandas.DataFrame([{'A': 'string_300_bla1', 'B': "Hi", 'C': 3},
                       {'A': 'string_300_blaa2', 'B': "Hello", 'C': 4},
                       {'A': 'string_487_blaaa1', 'B': "nice", 'C': 9},
                       {'A': 'string_487_blaaa2', 'B': "day", 'C': 6}])

I want to make a groupby from this part of the string

string_300_bla1

I tried:

import re

dfs = df['A'].str.contains('.*_\d+_.*', re.IGNORECASE).groupby(df['B'])

My output:

<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000279EFD009E8>

Good output:

dfs = pandas.DataFrame([{'A': 'string_300', 'B': "Hi\n\nHello"},
                       {'A': 'string_487', 'B': "nice\n\nday"}])

Upvotes: 5

Views: 103

Answers (2)

Quang Hoang
Quang Hoang

Reputation: 150805

We can do:

(df.groupby(df.A.str.extract('(\w+_\d+)')[0])
   .agg({'B':'\n\n'.join, 'C':'sum'})
   .reset_index()
)

Output:

            0            B   C
0  string_300  Hi\n\nHello   7
1  string_487  nice\n\nday  15

As pointed out by @CharlesGleason, here's the solution that extract the digit parts:

(df.groupby(df.A.str.extract('\w+_(\d+)')[0])
   .agg({'A':'first', 'B':'\n\n'.join, 'C':'sum'})
   .reset_index(drop=True)
)

Upvotes: 5

BENY
BENY

Reputation: 323366

You can use str.rsplit

df.B.groupby(df.A.str.rsplit('_',n=1).str[0]).agg('\n\n'.join).reset_index()
Out[236]: 
            A         B
0  string_300  Hi\n\nHello
1  string_487  nice\n\nday

Upvotes: 4

Related Questions