Apricot
Apricot

Reputation: 3021

Pandas subset randomly selected number of rows from dataframe based on values in another data

I have two pandas data frames.

df1:

d = {'col1': ["A", "A","A","B","B","C"], 'col2': [3, 4,5,6,7,8]}
e = pd.DataFrame(data=d)

df2:

f = {'col1': ["A","B","C"], 'col2': [2,1,1]}
g = pd.DataFrame(data=f)

I want to randomly select rows from df1 based on the values of col2 in df2 for each corresponding values of col1. For example, in df2, the count for A is 2, the count for B is 1, so on and so forth. I want to the use this count value from df2 and subset df1 randomly. To make it more explicit, the desired output for subsetted df1 is :

  col1  col2
0  A    3   
1  A    4   
2  B    7   
3  C    8 

The above dataframe has two rows of A, 1 row of B and 1 row of C, while retaining all the column values.

Upvotes: 1

Views: 67

Answers (2)

Vaishali
Vaishali

Reputation: 38415

You can use sample with parameter n

count = df2.set_index('col1')['col2'].to_dict()
df1.groupby('col1').apply(lambda x: x.sample(n=count[x.name])).reset_index(drop = True)


   col1 col2
0   A   4
1   A   3
2   B   6
3   C   8

Upvotes: 3

BENY
BENY

Reputation: 323366

We can using reindex + numpy shuffle, then using concat combine the result back

np.random.shuffle(e.index.values)
idx=e.index.values
np.random.shuffle(idx)
e=e.reindex(idx)
pd.concat([e[e.col1==x ].iloc[:y,:]for x,y in zip(g.col1,g.col2)])
Out[402]: 
  col1  col2
5    A     3
1    A     4
3    B     6
2    C     8

Upvotes: 2

Related Questions