Ahmad
Ahmad

Reputation: 9658

Pandas: Groupby, concatenate one column and identify the row with maximums

I have a datframe like this:

prefix  input_text target_text score
X       V           A           1
X       V           B           2
X       W           C           1
X       W           B           3

I want to group them by some columns and concatenate the column target_text, meanwhile get the maximum of score in each group and identify the target_text with highest score, like this:

prefix  input_text target_text score  top
X       V           A, B        2     B
X       W           C, B        3     B

This is my code which does the concatenation, however I just don't know about the rest.

df['target_text'] = df[['prefix', 'target_text','input_text']].groupby(['input_text','prefix'])['target_text'].transform(lambda x: '<br />'.join(x))
        
df = df.drop_duplicates(subset=['prefix','input_text','target_text'])

In concatenation I use html code to concat them, if I could bold the target with highest score, then it would be nice.

Upvotes: 2

Views: 83

Answers (3)

Anurag Dabas
Anurag Dabas

Reputation: 24314

try via sort_values(), groupby() and agg():

out=(df.sort_values('score')
       .groupby(['prefix', 'input_text'], as_index=False)
       .agg(target_text=('target_text', ', '.join), score=('score', 'max'), top=('target_text', 'last')))

output of out:

    input_text  prefix  score   target_text     top
0       V       X       2           A, B         B
1       W       X       3           C, B         B

Explaination:

we are sorting values of 'score' and then grouping by column 'input_text' and 'prefix' and aggregrating values that are as follows:

  • we are joining together the values of 'target_text' by ', '
  • we are getting only max value of 'score column' bcz we are aggregrating max
  • we are getting last value of 'target_text' column since we sorted previously so now we are aggregrating last on it

Update:

If you have many more columns to include then you can aggregrate them if they are not in high in number otherwise:

newdf=df.sort_values('score',ascending=False).drop_duplicates(['prefix','input_text'],ignore_index=True)
#Finally join them    
out=out.join(newdf[list of column names that you want])
#For example:
#out=out.join(newdf[['target_first','target_last]])

Upvotes: 3

BENY
BENY

Reputation: 323306

Let us try

df.sort_values('score',ascending=False).\
  drop_duplicates(['prefix','input_text']).\
    rename(columns={'target_text':'top'}).\
      merge(df.groupby(['prefix','input_text'],as_index=False)['target_text'].agg(','.join))
Out[259]: 
  prefix input_text top  score target_text
0      X          W   B      3         C,B
1      X          V   B      2         A,B

Upvotes: 4

Henry Ecker
Henry Ecker

Reputation: 35646

groupby agg would be useful here:

new_df = (
    df.groupby(['prefix', 'input_text'], as_index=False).agg(
        target_text=('target_text', ', '.join),
        score=('score', 'max'),
        top=('score', 'idxmax')
    )
)
new_df['top'] = df.loc[new_df['top'], 'target_text'].values

new_df:

  prefix input_text target_text  score top
0      X          V        A, B      2   B
1      X          W        C, B      3   B

Aggregations are as follows:

  1. target_text is joined together using ', '.join.
  2. score is aggregated to only keep the max value with `'max'
  3. top is the idxmax of the score column.
new_df = (
    df.groupby(['prefix', 'input_text'], as_index=False).agg(
        target_text=('target_text', ', '.join),
        score=('score', 'max'),
        top=('score', 'idxmax')
    )
)
  prefix input_text target_text  score  top
0      X          V        A, B      2    1
1      X          W        C, B      3    3

The values in top are the corresponding indexes from df:

  prefix input_text target_text  score
0      X          V           A      1
1      X          V           B      2  # index 1
2      X          W           C      1
3      X          W           B      3  # index 3

These values need to be "looked up" from df:

df.loc[new_df['top'], 'target_text']
1    B
3    B
Name: target_text, dtype: object

And assigned back to new_df. values is needed to break the index alignment.

Upvotes: 4

Related Questions