Reputation: 103
I have a dataframe dfyg
which is a Groupby object containing 120,000 groups. What's the best way to select 10,000 of these groups and pass them to the multiprocessing.Pool.map()
function?
I can think of a for loop which selects 10,000 groups and puts them in a list.
I cannot filter the dataframe before grouping because I would like to either pass all the rows in a group to the map
function or none at all.
i = 0
iter_list = []
for name, group in dfyg:
iter_list.append(group)
i = i + 1
if i >= 10000:
break
Upvotes: 0
Views: 1268
Reputation: 21274
You can create a subset of groups using the groups.keys()
property, then use groupby.filter()
:
subset = list(gb.groups.keys())[:n_grp]
gb.filter(lambda x: x.name in subset)
Data:
import numpy as np
import pandas as pd
n = 1000
n_grp = 2
grp = ["A", "B", "C", "D"]
data = {"grp": np.random.choice(grp, size=n, replace=True),
"val": np.random.random(size=n)}
df = pd.DataFrame(data)
gb = df.groupby("grp")
Upvotes: 2
Reputation: 323346
You still can filter them before groupby
using factorize
, this will assign each groupkey value to one int , then you slice the number less that 10000, or you can pick random using np.random.choice
(like groupneeed =np.random.choice(np.unique(pd.factorize(df.groupbykey)[0]),2,replace=False)
)
df=pd.DataFrame({'groupbykey':list('aabbddcc')})
df[pd.factorize(df.groupbykey)[0]<2]
groupbykey
0 a
1 a
2 b
3 b
#df[np.isin(pd.factorize(df.groupbykey)[0],groupneeed )]
Upvotes: 2