Reputation: 33
With R
's data.table
, you can perform an operation on a column by group that returns a vector with arbitrary length.
Suppose, I'd like to gather the top 5 values of one column, grouped by another. data.table
takes care of gluing the subgroups together.
require(magrittr)
library(data.table)
n <- 100
DT <- data.table(A = rnorm(n), B = sample(letters[1:4], n, replace = TRUE))
DT[, sort(B, decreasing = TRUE) %>% head(5), by = B]
With Python
and pandas
, I tried different strategies:
import pandas as pd
import random
n = 100
df = pd.DataFrame({'A' : np.random.randn(n),
'B' : [random.choice(list('abc')) for i in range(n)]})
# first try :
groups = df.groupby('B')
groups.apply(lambda x : sorted(x['A'], reverse=False)[:5])
but this solution gives back a 3x1 (assuming I have 3 groups) DataFrame with lists inside the rows, which does not appear weird though, as sorted will give back a list object.
# second try:
groups.apply(lambda x : np.sort(x['A'])[-5:])
is no more conclusive.
The solution I found, but looks really verbose to me, imply iterating the groups:
dfs = []
for g in groups:
dfs.append(g[1].sort_values('A').tail(5))
pd.concat(dfs).sort_values('B')
Any idea of what I could use as a more elegant way to perform this operation ? Thanks, I keep on searching for a Pythonic solution...
Upvotes: 3
Views: 1381
Reputation: 153460
IIUC:
n = 100
df = pd.DataFrame({'A' : np.random.randn(n),
'B' : [np.random.choice(list('abc')) for i in range(n)]})
groups = df.groupby('B')
groups.apply(lambda x : x['A'].sort_values().tail(5))
Output:
B
a 97 1.068696
26 1.142186
35 1.222854
18 1.379540
43 1.466977
b 46 1.185664
94 1.206992
49 1.464562
81 1.609808
51 1.814868
c 50 1.129484
8 1.155544
95 1.250349
9 1.337286
90 1.581751
Name: A, dtype: float64
Upvotes: 2