apkul
apkul

Reputation: 103

How to efficiently index a Groupby object?

I have a dataframe dfyg which is a Groupby object containing 120,000 groups. What's the best way to select 10,000 of these groups and pass them to the multiprocessing.Pool.map() function?

I can think of a for loop which selects 10,000 groups and puts them in a list. I cannot filter the dataframe before grouping because I would like to either pass all the rows in a group to the map function or none at all.

i = 0
iter_list = []
for name, group in dfyg:
    iter_list.append(group)
    i = i + 1
    if i >= 10000:
        break

Upvotes: 0

Views: 1268

Answers (2)

andrew_reece
andrew_reece

Reputation: 21274

You can create a subset of groups using the groups.keys() property, then use groupby.filter():

subset = list(gb.groups.keys())[:n_grp]
gb.filter(lambda x: x.name in subset)

Data:

import numpy as np
import pandas as pd

n = 1000
n_grp = 2
grp = ["A", "B", "C", "D"]
data = {"grp": np.random.choice(grp, size=n, replace=True),
        "val": np.random.random(size=n)}
df = pd.DataFrame(data)
gb = df.groupby("grp")

Upvotes: 2

BENY
BENY

Reputation: 323346

You still can filter them before groupby using factorize, this will assign each groupkey value to one int , then you slice the number less that 10000, or you can pick random using np.random.choice(like groupneeed =np.random.choice(np.unique(pd.factorize(df.groupbykey)[0]),2,replace=False))

df=pd.DataFrame({'groupbykey':list('aabbddcc')})
df[pd.factorize(df.groupbykey)[0]<2]
  groupbykey
0          a
1          a
2          b
3          b
#df[np.isin(pd.factorize(df.groupbykey)[0],groupneeed )]

Upvotes: 2

Related Questions