Reputation: 6778
Let's say I have the following data
import pandas as pd
df = pd.DataFrame(data=[[1, 'a'], [1, 'aaa'], [1, 'aa'],
[2, 'bb'], [2, 'bbb'],
[3, 'cc']],
columns=['key', 'text'])
key text
0 1 a
1 1 aaa
2 1 aa
3 2 bb
4 2 bbb
5 3 cc
What I would like to do is group by the key
variable and sort the data within each group by the length of text
and end up with a single Series
of index values to use to reindex the dataframe. I thought I could just do something like this:
df.groupby('key').text.str.len().sort_values(ascending=False).index
But it said I need to use apply
, so I tried this:
df.groupby('key').apply(lambda x: x.text.str.len().sort_values(ascending=False).index, axis=1)
But that told me that lambda
got an unexpected keyword: axis
.
I'm relatively new to pandas, so I'm not sure how to go about this. Also, my goal is to simply deduplicate the data such that for each key
, I keep the value with the longest value of text
. The expected output is:
key text
1 1 aaa
4 2 bbb
5 3 cc
If there's an easier way to do this than what I'm attempting, I'm open to that as well.
Upvotes: 2
Views: 3646
Reputation: 13955
No need for the intermediate step. You can get a series with the string lengths like this:
df['text'].str.len()
Now juut groupby key, and return the value indexed where the length of the string is largest using idxmax()
In [33]: df.groupby('key').agg(lambda x: x.loc[x.str.len().idxmax()])
Out[33]:
text
key
1 aaa
2 bbb
3 cc
Upvotes: 5
Reputation: 2785
def get_longest_string(row):
return [x for x in row.tolist() if len(x) == max([len(x) for x in row.tolist()])]
res = df.groupby('key')['text'].apply(get_longest_string).reset_index()
Output:
key text
0 1 [aaa]
1 2 [bbb]
2 3 [cc]
Upvotes: 1
Reputation: 153460
df.groupby('key', as_index=False).apply(lambda x: x[x.text.str.len() == x.text.str.len().max()])
Output:
key text
0 1 1 aaa
1 4 2 bbb
2 5 3 cc
Upvotes: 3