Leothorn
Leothorn

Reputation: 1345

Pick random values from a dataframe such that resultant dataframe is unique within two columns in python-pandas

Consider the dataframe:

data = [['G1','P1',0.3], ['G1','P2',0.3], ['G1','P3',0.3],['G2','P1',0.3],['G2','P2',0.3],['G3','P2',0.3]] 
df2 = pd.DataFrame(data, columns = ['GT', 'PRED','ACC']) 

df2 looks like this :

    GT  PRED    ACC
0   G1  P1      0.3
1   G1  P2      0.3
2   G1  P3      0.3
3   G2  P1      0.3
4   G2  P2      0.3
5   G3  P2      0.3

The objective is to pick random rows such that values in GT and PRED are picked uniquely. From other stack overflow searches, I have the following code:

size = 1        # sample size
replace = False # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
pt = df2.groupby('PRED', as_index=False).apply(fn)

whose output is

    GT  PRED    ACC
0   G1  P1      0.3
1   G1  P2      0.3
2   G1  P3      0.3

My expected output should be

    GT  PRED    ACC
0   G1  P3      0.3
1   G2  P1      0.3
2   G3  P2      0.3

That is at the time of filtering we are ensuring that GT and PRED are unique. I can ensure this by applying function after df generation in a for a loop . I wanted to avoid this.

Upvotes: 1

Views: 80

Answers (1)

Valentino
Valentino

Reputation: 7361

This could be a way.

  1. Make arrays with unique values of 'GT' and 'PRED' columns. They can be obtained using the unique method.
  2. Use numpy shuffle to shuffle each array (this will give you randomness).
  3. Make a dataframe pairs using the shuffled arrays. Each row of this array will be a pair of randomly chosen 'GT' and 'PRED' unique values.
  4. With pandas apply and some boolean logic, search which rows in df2 have a valid 'GT' and 'PRED' pair, i.e. a pair which is present in the pairs dataframe.
  5. Select those rows only from df2 to obtain your output.

Translated in a working code, this is:

gt_u = df2['GT'].unique()
pred_u = df2['PRED'].unique()

np.random.shuffle(gt_u)
np.random.shuffle(pred_u)

pairs = pd.DataFrame({'GT':gt_u, 'PRED':pred_u})    
#pairs = pd.DataFrame([[x, y] for x, y in zip(gt_u, pred_u)], columns=['GT', 'PRED'])

sel = df2.apply(lambda x : (x[['GT', 'PRED']] == pairs).all(axis=1).any(axis=0), axis=1)
out = df2.loc[sel]

Note: if you are not sure that the unique arrays will be of equal length, use the commented line to generate pairs.

Upvotes: 1

Related Questions