Reputation: 9658
I have a datframe like this:
prefix input_text target_text score
X V A 1
X V B 2
X W C 1
X W B 3
I want to group them by some columns and concatenate the column target_text
, meanwhile get the maximum of score
in each group and identify the target_text
with highest score, like this:
prefix input_text target_text score top
X V A, B 2 B
X W C, B 3 B
This is my code which does the concatenation, however I just don't know about the rest.
df['target_text'] = df[['prefix', 'target_text','input_text']].groupby(['input_text','prefix'])['target_text'].transform(lambda x: '<br />'.join(x))
df = df.drop_duplicates(subset=['prefix','input_text','target_text'])
In concatenation I use html code to concat them, if I could bold the target with highest score, then it would be nice.
Upvotes: 2
Views: 83
Reputation: 24314
try via sort_values()
, groupby()
and agg()
:
out=(df.sort_values('score')
.groupby(['prefix', 'input_text'], as_index=False)
.agg(target_text=('target_text', ', '.join), score=('score', 'max'), top=('target_text', 'last')))
output of out
:
input_text prefix score target_text top
0 V X 2 A, B B
1 W X 3 C, B B
Explaination:
we are sorting values of 'score' and then grouping by column 'input_text' and 'prefix' and aggregrating values that are as follows:
', '
max
last
on itUpdate:
If you have many more columns to include then you can aggregrate them if they are not in high in number otherwise:
newdf=df.sort_values('score',ascending=False).drop_duplicates(['prefix','input_text'],ignore_index=True)
#Finally join them
out=out.join(newdf[list of column names that you want])
#For example:
#out=out.join(newdf[['target_first','target_last]])
Upvotes: 3
Reputation: 323306
Let us try
df.sort_values('score',ascending=False).\
drop_duplicates(['prefix','input_text']).\
rename(columns={'target_text':'top'}).\
merge(df.groupby(['prefix','input_text'],as_index=False)['target_text'].agg(','.join))
Out[259]:
prefix input_text top score target_text
0 X W B 3 C,B
1 X V B 2 A,B
Upvotes: 4
Reputation: 35646
groupby agg
would be useful here:
new_df = (
df.groupby(['prefix', 'input_text'], as_index=False).agg(
target_text=('target_text', ', '.join),
score=('score', 'max'),
top=('score', 'idxmax')
)
)
new_df['top'] = df.loc[new_df['top'], 'target_text'].values
new_df
:
prefix input_text target_text score top
0 X V A, B 2 B
1 X W C, B 3 B
Aggregations are as follows:
target_text
is joined together using ', '.join
.score
is aggregated to only keep the max value with `'max'top
is the idxmax
of the score
column.new_df = (
df.groupby(['prefix', 'input_text'], as_index=False).agg(
target_text=('target_text', ', '.join),
score=('score', 'max'),
top=('score', 'idxmax')
)
)
prefix input_text target_text score top
0 X V A, B 2 1
1 X W C, B 3 3
The values in top
are the corresponding indexes from df
:
prefix input_text target_text score
0 X V A 1
1 X V B 2 # index 1
2 X W C 1
3 X W B 3 # index 3
These values need to be "looked up" from df
:
df.loc[new_df['top'], 'target_text']
1 B
3 B
Name: target_text, dtype: object
And assigned back to new_df
. values
is needed to break the index alignment.
Upvotes: 4