Tracy Boodhoo
Tracy Boodhoo

Reputation: 33

How can I choose a random sample of size n from values from a single pandas dataframe column, with repeating values occurring a maximum of 2 times?

My dataframe looks like this:

Identifier       Strain     Other columns, etc.
1                  A
2                  C
3                  D
4                  B
5                  A
6                  C
7                  C
8                  B
9                  D
10                 A
11                 D
12                 D

I want to choose n rows at random while maintaining diversity in the strain values. For example, I want a group of 6, so I'd expect my final rows to include at least one of every type of strain with two strains appearing twice.

I've tried converting the Strain column into a numpy array and using the method random.choice but that didn't seem to run. I've also tried using .sample but it does not maximize strain diversity.

This is my latest attempt which outputs a sample of size 7 in order (identifiers 0-7) and the Strains are all the same.

randomsample = df[df.Strain == np.random.choice(df['Strain'].unique())].reset_index(drop=True)

Upvotes: 2

Views: 508

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150785

I believe there's something in numpy that does exactly this, but can't recall which. Here's a fairly fast approach:

  1. Shuffle the data for randomness
  2. enumerate the rows within each group
  3. sort by the enumeration above
  4. slice the top n rows

So in code:

n = 6

df = df.sample(frac=1)                      # step 1 
enums = df.groupby('Strain').cumcount()     # step 2 
        
orders = np.argsort(enums)                  # step 3
samples = df.iloc[orders[:n]]               # step 4

Output:

   Identifier Strain  Other columns, etc.
2           3      D                  NaN
7           8      B                  NaN
0           1      A                  NaN
5           6      C                  NaN
4           5      A                  NaN
8           9      D                  NaN

Upvotes: 2

Related Questions