Reputation: 1777
In the following multi-indexed dataframe...:
import pandas as pd
import numpy as np
sz=100
typ = [np.random.choice(['ABC', 'DEF', 'GHI', 'JKL']) for _ in range(sz)]
sub_typ = [np.random.choice(['Up', 'Down']) for _ in range(sz)]
field1 = [np.random.randint(0, 9) for _ in range(sz)]
field2 = [np.random.uniform() for _ in range(sz)]
df = pd.DataFrame({'typ': typ, 'sub_typ': sub_typ, 'field1': field1, 'field2': field2})
max_rows = {'ABC': 5, 'JKL': 3} # Maximum no of rows to be selected by type
... I would like to extract the first few rows into a dataframe, based on this dictionary:
max_rows = {'ABC': 2, 'JKL': 3} # Maximum no of rows to be selected by typ
I am able to see the indexes using df.groupby(['typ', 'sub_typ']).apply(lambda g: g.index)
command, but don't know how to retrieve the needed rows. Expected results would be rows with indexes 42 and 65 of ABC.Down, 27 and 48 of ABC.Up, 2, 10 and 20 of JKL.Up ...
typ sub_typ
ABC Down Int64Index([42, 65, 70, 77], dtype='int64')
Up Int64Index([27, 48, 54, 57, 63, 83, 89, 98], d...
DEF Down Int64Index([3, 4, 5, 8, 12, 13, 16, 23, 28, 36...
Up Int64Index([14, 15, 19, 21, 29, 35, 40, 46, 49...
GHI Down Int64Index([0, 6, 7, 9, 11, 30, 34, 37, 38, 53...
Up Int64Index([1, 17, 41, 43, 52, 56, 59, 62, 71,...
JKL Down Int64Index([2, 10, 20, 22, 24, 26, 45, 47, 51,...
Up Int64Index([18, 25, 31, 32, 33, 61, 79, 84, 85...
dtype: object
Is there a way to achieve this?
Upvotes: 1
Views: 24
Reputation: 862911
First filter only rows matching keys in dictionary by Series.isin
and boolean indexing
and then map dictionary to DataFrame.head
in lambda function:
np.random.seed(20)
sz=100
typ = [np.random.choice(['ABC', 'DEF', 'GHI', 'JKL']) for _ in range(sz)]
sub_typ = [np.random.choice(['Up', 'Down']) for _ in range(sz)]
field1 = [np.random.randint(0, 9) for _ in range(sz)]
field2 = [np.random.uniform() for _ in range(sz)]
df = pd.DataFrame({'typ': typ, 'sub_typ': sub_typ, 'field1': field1, 'field2': field2})
max_rows = {'ABC': 2, 'JKL': 3} # Maximum no of rows to be selected by type
mask = df['typ'].isin(max_rows.keys())
df = df[mask].groupby(['typ', 'sub_typ']).apply(lambda x: x.head(max_rows[x.name[0]]))
print (df)
typ sub_typ field1 field2
typ sub_typ
ABC Down 7 ABC Down 1 0.618937
14 ABC Down 4 0.415884
Up 4 ABC Up 5 0.822250
12 ABC Up 3 0.138418
JKL Down 8 JKL Down 8 0.203591
18 JKL Down 4 0.779920
19 JKL Down 3 0.843493
Up 0 JKL Up 1 0.648625
2 JKL Up 3 0.920890
3 JKL Up 6 0.437324
For remove MultiIndex
use:
df = df.reset_index(drop=True)
print (df)
typ sub_typ field1 field2
0 ABC Down 1 0.618937
1 ABC Down 4 0.415884
2 ABC Up 5 0.822250
3 ABC Up 3 0.138418
4 JKL Down 8 0.203591
5 JKL Down 4 0.779920
6 JKL Down 3 0.843493
7 JKL Up 1 0.648625
8 JKL Up 3 0.920890
9 JKL Up 6 0.437324
Upvotes: 1