Reputation:
I have dataframe as follows:
data = {
'values': ['abc and sea', 'def work', 'abc', 'ram cold', 'myran add'],
'gems': ['A1, A2, A3, A4', 'B1, A1, B2, B3', 'C1, A1', 'KS, KM, A1, A3', 'A2, A1, JP, CVK'],
'Name': ['Drama', 'Drama', 'Drama', 'Ant and Boat', 'Ant and Boat']
}
df = pd.DataFrame(data)
df
values gems Name
0 abc and sea A1, A2, A3, A4 Drama
1 def work B1, A1, B2, B3 Drama
2 abc C1, A1 Drama
3 ram cold KS, KM, A1, A3 Ant and Boat
4 myran add A2, A1, JP, CVK Ant and Boat
I want count unique df['gems'] for each df['Name']. The expected output is:
Name gems Count
0 Drama A1, A2, A3, A4, B1, B2, B3, C1 8
1 Ant and Boat KS, KM, A1, A3, A2, JP, CVK 7
Any help appreciated..
Upvotes: 0
Views: 89
Reputation: 893
new_df = pd.concat(
(
df['gems'].str.findall(r'[A-Z]+\d?').explode(),
df['Name']
), axis=1
).drop_duplicates().groupby('Name').agg(pd.Series.tolist)
new_df['Count'] = new_df['gems'].apply(len)
This explodes the gems
column of the original DataFrame, maps it to its associated name, removes the duplicate matches, implodes them back into lists, and then counts those lists.
Here's an alternative for the first statement:
new_df = (
df.assign(gems=df['gems'].str.findall(r'[A-Z]+\d?'))
.drop('values', axis=1)
.explode('gems')
.drop_duplicates()
.groupby('Name')
.agg(pd.Series.tolist)
)
Upvotes: 0
Reputation: 16162
You could get the set of the split/exploded values in a groupby, then take the len of that column.
df= df.groupby('Name').agg({'gems':lambda x: set(x.str.split(', ').explode())}).reset_index()
df['count'] = df['gems'].apply(len)
Upvotes: 0