Reputation: 109
I'm trying to build a new dataframe from grouping part of the string into a column.
import pandas
df = pandas.DataFrame([{'A': 'string_300_bla1', 'B': "Hi", 'C': 3},
{'A': 'string_300_blaa2', 'B': "Hello", 'C': 4},
{'A': 'string_487_blaaa1', 'B': "nice", 'C': 9},
{'A': 'string_487_blaaa2', 'B': "day", 'C': 6}])
I want to make a groupby from this part of the string
string_300_bla1
I tried:
import re
dfs = df['A'].str.contains('.*_\d+_.*', re.IGNORECASE).groupby(df['B'])
My output:
<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000279EFD009E8>
Good output:
dfs = pandas.DataFrame([{'A': 'string_300', 'B': "Hi\n\nHello"},
{'A': 'string_487', 'B': "nice\n\nday"}])
Upvotes: 5
Views: 103
Reputation: 150805
We can do:
(df.groupby(df.A.str.extract('(\w+_\d+)')[0])
.agg({'B':'\n\n'.join, 'C':'sum'})
.reset_index()
)
Output:
0 B C
0 string_300 Hi\n\nHello 7
1 string_487 nice\n\nday 15
As pointed out by @CharlesGleason, here's the solution that extract the digit parts:
(df.groupby(df.A.str.extract('\w+_(\d+)')[0])
.agg({'A':'first', 'B':'\n\n'.join, 'C':'sum'})
.reset_index(drop=True)
)
Upvotes: 5
Reputation: 323366
You can use str.rsplit
df.B.groupby(df.A.str.rsplit('_',n=1).str[0]).agg('\n\n'.join).reset_index()
Out[236]:
A B
0 string_300 Hi\n\nHello
1 string_487 nice\n\nday
Upvotes: 4