Pandas equivalent to data.table

Question

With R's data.table, you can perform an operation on a column by group that returns a vector with arbitrary length.

Suppose, I'd like to gather the top 5 values of one column, grouped by another. data.table takes care of gluing the subgroups together.

require(magrittr)
library(data.table)

n <- 100
DT <- data.table(A = rnorm(n), B = sample(letters[1:4], n, replace = TRUE))
DT[, sort(B, decreasing = TRUE) %>% head(5), by = B]

With Python and pandas, I tried different strategies:

import pandas as pd
import random

n = 100
df = pd.DataFrame({'A' : np.random.randn(n), 
                   'B' : [random.choice(list('abc')) for i in range(n)]})
# first try : 
groups = df.groupby('B')
groups.apply(lambda x : sorted(x['A'], reverse=False)[:5])

but this solution gives back a 3x1 (assuming I have 3 groups) DataFrame with lists inside the rows, which does not appear weird though, as sorted will give back a list object.

# second try:
groups.apply(lambda x : np.sort(x['A'])[-5:])

is no more conclusive.

The solution I found, but looks really verbose to me, imply iterating the groups:

dfs = []
for g in groups:
    dfs.append(g[1].sort_values('A').tail(5))
pd.concat(dfs).sort_values('B')

Any idea of what I could use as a more elegant way to perform this operation ? Thanks, I keep on searching for a Pythonic solution...

Scott Boston · Accepted Answer

IIUC:

n = 100
df = pd.DataFrame({'A' : np.random.randn(n), 
                   'B' : [np.random.choice(list('abc')) for i in range(n)]})
groups = df.groupby('B')
groups.apply(lambda x : x['A'].sort_values().tail(5))

Output:

B    
a  97    1.068696
   26    1.142186
   35    1.222854
   18    1.379540
   43    1.466977
b  46    1.185664
   94    1.206992
   49    1.464562
   81    1.609808
   51    1.814868
c  50    1.129484
   8     1.155544
   95    1.250349
   9     1.337286
   90    1.581751
Name: A, dtype: float64

Pandas equivalent to data.table

Answers (1)

Related Questions