Reputation: 179
My dataframe df
looks something like this:
id value
10 a
10 d
10 g
10 g
10 g
23 g
23 h
11 h
11 h
11 h
44 h
44 h
I want to split this dataframe into n different dataframes such that each dataframe has approx equal unique ids.
i was trying something like below:
ids =df.id.unique()
ids_in_split =np.array_split(ids,n)
this creates the splits of ids that should be there in each split of df
. How do I split the original df using the ids_in_split
?
any other more efficient way to do this is also welcome.
Editing for the expected outcome:
say i want to split the df into n =2 they should be like:
df1 =
id value
10 a
10 d
10 g
10 g
10 g
23 g
23 h
df2 =
id value
11 h
11 h
11 h
44 h
44 h
In the above output both the split have all records of equal number of the unique iDs
Upvotes: 0
Views: 408
Reputation: 32095
Unclear the type of output you're searching for, here is a possible interpretation and result:
df
Out[11]:
id value
0 10 a
1 10 d
2 10 g
3 10 g
4 10 g
...
df.reset_index()
Out[12]:
index id value
0 0 10 a
1 1 10 d
2 2 10 g
3 3 10 g
4 4 10 g
...
df['split'] = df.reset_index().groupby('id')['index'].rank()
df.sort_values('split')
Out[17]:
id value split
0 10 a 1.0
5 23 g 1.0
7 11 h 1.0
10 44 h 1.0
1 10 d 2.0
6 23 h 2.0
8 11 h 2.0
11 44 h 2.0
2 10 g 3.0
9 11 h 3.0
3 10 g 4.0
4 10 g 5.0
Now you can groupby split
column to get your dataframes.
Upvotes: 0
Reputation: 2927
>>> df = pd.DataFrame({'id':[10, 10, 10, 10, 10, 23, 23, 11, 11, 11, 44, 44],
'value': ['a', 'd', 'g', 'g', 'g', 'g', 'h', 'h', 'h', 'h', 'h', 'h']})
We group by 'id' and then unpack the grouped data frame into a tuple for each group. The second item of the tuple is a data frame.
>>> df1, df2, df3, df4 = df.groupby('id')
>>> df1[1]
id value
0 10 a
1 10 d
2 10 g
3 10 g
4 10 g
>>> type(df1[1])
<class 'pandas.core.frame.DataFrame'>
Upvotes: 2