GNMO11
GNMO11

Reputation: 2259

Pandas create random samples without duplicates

I have a pandas dataframe containing ~200,000 rows and I would like to create 5 random samples of 1000 rows each however I do not want any of these samples to contain the same row twice.

To create a random sample I have been using:

import numpy as np
rows = np.random.choice(df.index.values, 1000)
sampled_df = df.ix[rows]

However just doing this several times would run the risk of having duplicates. Would the best way to handle this be keeping track of which rows are sampled each time?

Upvotes: 4

Views: 11091

Answers (3)

user2285236
user2285236

Reputation:

You can use df.sample.

A dataframe with 100 rows and 5 columns:

df = pd.DataFrame(np.random.randn(100, 5), columns = list("abcde"))

Sample 5 rows:

df.sample(5)
Out[8]: 
           a         b         c         d         e
84  0.012201 -0.053014 -0.952495  0.680935  0.006724
45 -1.347292  1.358781 -0.838931 -0.280550 -0.037584
10 -0.487169  0.999899  0.524546 -1.289632 -0.370625
64  1.542704 -0.971672 -1.150900  0.554445 -1.328722
99  0.012143 -2.450915 -0.718519 -1.192069 -1.268863

This ensures those 5 rows are different. If you want to repeat this process, I'd suggest sampling number_of_rows * number_of_samples rows. For example if each sample is going to contain 5 rows and you need 10 samples, sample 50 rows. The first 5 will be the first sample, the second five will be the second...

all_samples = df.sample(50)
samples = [all_samples.iloc[5*i:5*i+5] for i in range(10)]

Upvotes: 9

lsxliron
lsxliron

Reputation: 540

Take a look on numpy.random docs

For your solution:

import numpy as np
rows = np.random.choice(df.index.values, 1000, replace=False)
sampled_df = df.ix[rows]

This will make random choices without replacement.

If you want to generate multiple samples that none will have any elements in common you will need to remove the elements from each choice after each iteration. You can usenumpy.setdiff1d for that.

import numpy as np
allRows = df.index.values
numOfSamples = 5
samples = list()

for i in xrange(numOfSamples):
    choices = np.random.choice(allRows, 1000, replace=False)
    samples.append(choices)
    allRows = np.setdiff1d(allRows, choices)

Here is a working example with a range of numbers between 0 and 100:

In [58]: import numpy as np
In [59]: allRows = np.arange(100)
In [60]: numOfSamples = 5
In [61]: samples = list()
In [62]: for i in xrange(numOfSamples):
   ....:     choices = np.random.choice(allRows, 5, replace=False)
   ....:     samples.append(choices)
   ....:     allRows = np.setdiff1d(allRows, choices)
   ....:

In [63]: samples
Out[63]:
[array([66, 24, 47, 31, 22]),
 array([ 8, 28, 15, 62, 52]),
 array([18, 65, 71, 54, 48]),
 array([59, 88, 43,  7, 85]),
 array([97, 36, 55, 56, 14])]

In [64]: allRows
Out[64]:
array([ 0,  1,  2,  3,  4,  5,  6,  9, 10, 11, 12, 13, 16, 17, 19, 20, 21,
       23, 25, 26, 27, 29, 30, 32, 33, 34, 35, 37, 38, 39, 40, 41, 42, 44,
       45, 46, 49, 50, 51, 53, 57, 58, 60, 61, 63, 64, 67, 68, 69, 70, 72,
       73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 86, 87, 89, 90, 91,
       92, 93, 94, 95, 96, 98, 99])

Upvotes: 2

C_Z_
C_Z_

Reputation: 7816

You can set replace to False in np.random.choice

rows = np.random.choice(df.index.values, 1000, replace=False)

Upvotes: 2

Related Questions