Apoorv
Apoorv

Reputation: 179

split python dataframe in to equal numbers based on the unique values of a column

My dataframe df looks something like this:

id  value 
10  a
10  d
10  g
10  g
10  g
23  g
23  h
11  h
11  h
11  h
44  h
44  h

I want to split this dataframe into n different dataframes such that each dataframe has approx equal unique ids.

i was trying something like below:

ids =df.id.unique()
ids_in_split =np.array_split(ids,n)

this creates the splits of ids that should be there in each split of df. How do I split the original df using the ids_in_split? any other more efficient way to do this is also welcome.

Editing for the expected outcome:

say i want to split the df into n =2 they should be like:

df1 =
id  value 
10  a
10  d
10  g
10  g
10  g
23  g
23  h

df2 = 
id  value 
11  h
11  h
11  h
44  h
44  h

In the above output both the split have all records of equal number of the unique iDs

Upvotes: 0

Views: 408

Answers (2)

Zeugma
Zeugma

Reputation: 32095

Unclear the type of output you're searching for, here is a possible interpretation and result:

df
Out[11]: 
    id value
0   10     a
1   10     d
2   10     g
3   10     g
4   10     g
...

df.reset_index()
Out[12]: 
    index  id value
0       0  10     a
1       1  10     d
2       2  10     g
3       3  10     g
4       4  10     g
...

df['split'] = df.reset_index().groupby('id')['index'].rank()


df.sort_values('split')
Out[17]: 
    id value  split
0   10     a    1.0
5   23     g    1.0
7   11     h    1.0
10  44     h    1.0
1   10     d    2.0
6   23     h    2.0
8   11     h    2.0
11  44     h    2.0
2   10     g    3.0
9   11     h    3.0
3   10     g    4.0
4   10     g    5.0

Now you can groupby split column to get your dataframes.

Upvotes: 0

spies006
spies006

Reputation: 2927

>>> df = pd.DataFrame({'id':[10, 10, 10, 10, 10, 23, 23, 11, 11, 11, 44, 44], 
    'value': ['a', 'd', 'g', 'g', 'g', 'g', 'h', 'h', 'h', 'h', 'h', 'h']})

We group by 'id' and then unpack the grouped data frame into a tuple for each group. The second item of the tuple is a data frame.

>>> df1, df2, df3, df4 = df.groupby('id')

>>> df1[1]
   id value
0  10     a
1  10     d
2  10     g
3  10     g
4  10     g

>>> type(df1[1])
<class 'pandas.core.frame.DataFrame'>

Upvotes: 2

Related Questions