getting the index of the max value in list and returning the values in subsequent lists

Question

I have a df

review                          review_id   word_count
['abc def', 'ghi']              [1, 2]      [2, 1]
['jkl mno pqr', 'stu', 'vwx']   [3, 4, 5]   [3, 1, 1]

For each row in the df, I would like to create new columns for review and review_id whereby there will only consist of the review/review_id with the maximum word count. Thus, the output should look something like this.

review         review_id
abc def        1      
jkl mno pqr    3

I can only think of getting the index of the max value in word_count. Any sugeestions will be welcomed. Thanks!

jezrael · Accepted Answer

Use DataFrame.explode with compare maximal values per groups (from rows) and select rows by boolean indexing if need all maximal values:

df = df.explode(['review','review_id','word_count'])
#for oldier pandas versions
#df = df.apply(lambda x: x.explode())

df2 = df[df['word_count'].eq(df.groupby(level=0)['word_count'].transform('max'))]
print (df2)
        review review_id word_count
0      abc def         1          2
1  jkl mno pqr         3          3

EDIT: First create unique MultiIndex:

df = (df.apply(lambda x: x.explode())
        .assign(new=lambda x: range(len(x)))
        .set_index('new', append=True))
print (df)
            review review_id word_count
  new                                  
0 0        abc def         1          2
  1            ghi         2          1
1 2    jkl mno pqr         3          3
  3            stu         4          1
  4            vwx         5          1

Then get indices for maximal word_count per first level of MultiIndex (groups represent rows) by DataFrameGroupBy.idxmax and select them by DataFrame.loc:

df['word_count'] = df['word_count'].astype(int)

df2 = df.loc[df.groupby(level=0)['word_count'].idxmax()].reset_index(drop=True)
print (df2)
        review review_id  word_count
0      abc def         1           2
1  jkl mno pqr         3           3

getting the index of the max value in list and returning the values in subsequent lists

Answers (2)

timing

Related Questions