reservoirinvest
reservoirinvest

Reputation: 1777

Getting certain number of rows in a multi-indexed dataframe

In the following multi-indexed dataframe...:

import pandas as pd
import numpy as np

sz=100
typ = [np.random.choice(['ABC', 'DEF', 'GHI', 'JKL']) for _ in range(sz)]
sub_typ = [np.random.choice(['Up', 'Down']) for _ in range(sz)]
field1 = [np.random.randint(0, 9) for _ in range(sz)]
field2 = [np.random.uniform() for _ in range(sz)]

df = pd.DataFrame({'typ': typ, 'sub_typ': sub_typ, 'field1': field1, 'field2': field2})

max_rows = {'ABC': 5, 'JKL': 3} # Maximum no of rows to be selected by type

... I would like to extract the first few rows into a dataframe, based on this dictionary:

max_rows = {'ABC': 2, 'JKL': 3} # Maximum no of rows to be selected by typ

I am able to see the indexes using df.groupby(['typ', 'sub_typ']).apply(lambda g: g.index) command, but don't know how to retrieve the needed rows. Expected results would be rows with indexes 42 and 65 of ABC.Down, 27 and 48 of ABC.Up, 2, 10 and 20 of JKL.Up ...

typ  sub_typ
ABC  Down             Int64Index([42, 65, 70, 77], dtype='int64')
     Up         Int64Index([27, 48, 54, 57, 63, 83, 89, 98], d...
DEF  Down       Int64Index([3, 4, 5, 8, 12, 13, 16, 23, 28, 36...
     Up         Int64Index([14, 15, 19, 21, 29, 35, 40, 46, 49...
GHI  Down       Int64Index([0, 6, 7, 9, 11, 30, 34, 37, 38, 53...
     Up         Int64Index([1, 17, 41, 43, 52, 56, 59, 62, 71,...
JKL  Down       Int64Index([2, 10, 20, 22, 24, 26, 45, 47, 51,...
     Up         Int64Index([18, 25, 31, 32, 33, 61, 79, 84, 85...
dtype: object

Is there a way to achieve this?

Upvotes: 1

Views: 24

Answers (1)

jezrael
jezrael

Reputation: 862911

First filter only rows matching keys in dictionary by Series.isin and boolean indexing and then map dictionary to DataFrame.head in lambda function:

np.random.seed(20)
    
sz=100
typ = [np.random.choice(['ABC', 'DEF', 'GHI', 'JKL']) for _ in range(sz)]
sub_typ = [np.random.choice(['Up', 'Down']) for _ in range(sz)]
field1 = [np.random.randint(0, 9) for _ in range(sz)]
field2 = [np.random.uniform() for _ in range(sz)]

df = pd.DataFrame({'typ': typ, 'sub_typ': sub_typ, 'field1': field1, 'field2': field2})

max_rows = {'ABC': 2, 'JKL': 3} # Maximum no of rows to be selected by type


mask = df['typ'].isin(max_rows.keys())
df = df[mask].groupby(['typ', 'sub_typ']).apply(lambda x: x.head(max_rows[x.name[0]]))
print (df)
                typ sub_typ  field1    field2
typ sub_typ                                  
ABC Down    7   ABC    Down       1  0.618937
            14  ABC    Down       4  0.415884
    Up      4   ABC      Up       5  0.822250
            12  ABC      Up       3  0.138418
JKL Down    8   JKL    Down       8  0.203591
            18  JKL    Down       4  0.779920
            19  JKL    Down       3  0.843493
    Up      0   JKL      Up       1  0.648625
            2   JKL      Up       3  0.920890
            3   JKL      Up       6  0.437324

For remove MultiIndex use:

df = df.reset_index(drop=True)
print (df)
   typ sub_typ  field1    field2
0  ABC    Down       1  0.618937
1  ABC    Down       4  0.415884
2  ABC      Up       5  0.822250
3  ABC      Up       3  0.138418
4  JKL    Down       8  0.203591
5  JKL    Down       4  0.779920
6  JKL    Down       3  0.843493
7  JKL      Up       1  0.648625
8  JKL      Up       3  0.920890
9  JKL      Up       6  0.437324

Upvotes: 1

Related Questions