Reputation: 1345
Consider the dataframe:
data = [['G1','P1',0.3], ['G1','P2',0.3], ['G1','P3',0.3],['G2','P1',0.3],['G2','P2',0.3],['G3','P2',0.3]]
df2 = pd.DataFrame(data, columns = ['GT', 'PRED','ACC'])
df2 looks like this :
GT PRED ACC
0 G1 P1 0.3
1 G1 P2 0.3
2 G1 P3 0.3
3 G2 P1 0.3
4 G2 P2 0.3
5 G3 P2 0.3
The objective is to pick random rows such that values in GT and PRED are picked uniquely. From other stack overflow searches, I have the following code:
size = 1 # sample size
replace = False # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
pt = df2.groupby('PRED', as_index=False).apply(fn)
whose output is
GT PRED ACC
0 G1 P1 0.3
1 G1 P2 0.3
2 G1 P3 0.3
My expected output should be
GT PRED ACC
0 G1 P3 0.3
1 G2 P1 0.3
2 G3 P2 0.3
That is at the time of filtering we are ensuring that GT and PRED are unique. I can ensure this by applying function after df generation in a for a loop . I wanted to avoid this.
Upvotes: 1
Views: 80
Reputation: 7361
This could be a way.
'GT'
and 'PRED'
columns. They can be obtained using the unique method.numpy
shuffle to shuffle each array (this will give you randomness).pairs
using the shuffled arrays. Each row of this array will be a pair of randomly chosen 'GT'
and 'PRED'
unique values.apply
and some boolean logic, search which rows in df2
have a valid 'GT'
and 'PRED'
pair, i.e. a pair which is present in the pairs
dataframe.df2
to obtain your output.Translated in a working code, this is:
gt_u = df2['GT'].unique()
pred_u = df2['PRED'].unique()
np.random.shuffle(gt_u)
np.random.shuffle(pred_u)
pairs = pd.DataFrame({'GT':gt_u, 'PRED':pred_u})
#pairs = pd.DataFrame([[x, y] for x, y in zip(gt_u, pred_u)], columns=['GT', 'PRED'])
sel = df2.apply(lambda x : (x[['GT', 'PRED']] == pairs).all(axis=1).any(axis=0), axis=1)
out = df2.loc[sel]
Note: if you are not sure that the unique arrays will be of equal length, use the commented line to generate pairs
.
Upvotes: 1