odebear
odebear

Reputation: 53

getting the index of the max value in list and returning the values in subsequent lists

I have a df

review                          review_id   word_count
['abc def', 'ghi']              [1, 2]      [2, 1]
['jkl mno pqr', 'stu', 'vwx']   [3, 4, 5]   [3, 1, 1]

For each row in the df, I would like to create new columns for review and review_id whereby there will only consist of the review/review_id with the maximum word count. Thus, the output should look something like this.

review         review_id
abc def        1      
jkl mno pqr    3

I can only think of getting the index of the max value in word_count. Any sugeestions will be welcomed. Thanks!

Upvotes: 2

Views: 82

Answers (2)

jezrael
jezrael

Reputation: 862831

Use DataFrame.explode with compare maximal values per groups (from rows) and select rows by boolean indexing if need all maximal values:

df = df.explode(['review','review_id','word_count'])
#for oldier pandas versions
#df = df.apply(lambda x: x.explode())

df2 = df[df['word_count'].eq(df.groupby(level=0)['word_count'].transform('max'))]
print (df2)
        review review_id word_count
0      abc def         1          2
1  jkl mno pqr         3          3

EDIT: First create unique MultiIndex:

df = (df.apply(lambda x: x.explode())
        .assign(new=lambda x: range(len(x)))
        .set_index('new', append=True))
print (df)
            review review_id word_count
  new                                  
0 0        abc def         1          2
  1            ghi         2          1
1 2    jkl mno pqr         3          3
  3            stu         4          1
  4            vwx         5          1
  

Then get indices for maximal word_count per first level of MultiIndex (groups represent rows) by DataFrameGroupBy.idxmax and select them by DataFrame.loc:

df['word_count'] = df['word_count'].astype(int)

df2 = df.loc[df.groupby(level=0)['word_count'].idxmax()].reset_index(drop=True)
print (df2)
        review review_id  word_count
0      abc def         1           2
1  jkl mno pqr         3           3

Upvotes: 4

mozway
mozway

Reputation: 260965

For efficiency, you can use a list comprehension and the DataFrame constructor. This will be at least one order of magnitude faster than using explode:

import numpy as np
df2 = pd.DataFrame([(a[(m:=np.argmax(x))], b[m]) for a,b,x in
                    zip(df['review'], df['review_id'], df['word_count'])],
                   index=df.index, columns=['review', 'review_id']
                   )
print(df2)

output:

        review  review_id
0      abc def          1
1  jkl mno pqr          3

If you want all columns:

df2 = pd.DataFrame([(a[(m:=np.argmax(x))], b[m], x[m]) for a,b,x in
                    zip(df['review'], df['review_id'], df['word_count'])],
                   index=df.index, columns=df.columns
                   )

output:

        review  review_id  word_count
0      abc def          1           2
1  jkl mno pqr          3           3

timing

on 20k rows

# list comprehension
50.2 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# explode + groupby
1.22 s ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 1

Related Questions