Reputation: 2259
I have a pandas dataframe containing ~200,000 rows and I would like to create 5 random samples of 1000 rows each however I do not want any of these samples to contain the same row twice.
To create a random sample I have been using:
import numpy as np
rows = np.random.choice(df.index.values, 1000)
sampled_df = df.ix[rows]
However just doing this several times would run the risk of having duplicates. Would the best way to handle this be keeping track of which rows are sampled each time?
Upvotes: 4
Views: 11091
Reputation:
You can use df.sample
.
A dataframe with 100 rows and 5 columns:
df = pd.DataFrame(np.random.randn(100, 5), columns = list("abcde"))
Sample 5 rows:
df.sample(5)
Out[8]:
a b c d e
84 0.012201 -0.053014 -0.952495 0.680935 0.006724
45 -1.347292 1.358781 -0.838931 -0.280550 -0.037584
10 -0.487169 0.999899 0.524546 -1.289632 -0.370625
64 1.542704 -0.971672 -1.150900 0.554445 -1.328722
99 0.012143 -2.450915 -0.718519 -1.192069 -1.268863
This ensures those 5 rows are different. If you want to repeat this process, I'd suggest sampling number_of_rows * number_of_samples rows. For example if each sample is going to contain 5 rows and you need 10 samples, sample 50 rows. The first 5 will be the first sample, the second five will be the second...
all_samples = df.sample(50)
samples = [all_samples.iloc[5*i:5*i+5] for i in range(10)]
Upvotes: 9
Reputation: 540
Take a look on numpy.random docs
For your solution:
import numpy as np
rows = np.random.choice(df.index.values, 1000, replace=False)
sampled_df = df.ix[rows]
This will make random choices without replacement.
If you want to generate multiple samples that none will have any elements in common you will need to remove the elements from each choice after each iteration. You can usenumpy.setdiff1d for that.
import numpy as np
allRows = df.index.values
numOfSamples = 5
samples = list()
for i in xrange(numOfSamples):
choices = np.random.choice(allRows, 1000, replace=False)
samples.append(choices)
allRows = np.setdiff1d(allRows, choices)
Here is a working example with a range of numbers between 0 and 100:
In [58]: import numpy as np
In [59]: allRows = np.arange(100)
In [60]: numOfSamples = 5
In [61]: samples = list()
In [62]: for i in xrange(numOfSamples):
....: choices = np.random.choice(allRows, 5, replace=False)
....: samples.append(choices)
....: allRows = np.setdiff1d(allRows, choices)
....:
In [63]: samples
Out[63]:
[array([66, 24, 47, 31, 22]),
array([ 8, 28, 15, 62, 52]),
array([18, 65, 71, 54, 48]),
array([59, 88, 43, 7, 85]),
array([97, 36, 55, 56, 14])]
In [64]: allRows
Out[64]:
array([ 0, 1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 16, 17, 19, 20, 21,
23, 25, 26, 27, 29, 30, 32, 33, 34, 35, 37, 38, 39, 40, 41, 42, 44,
45, 46, 49, 50, 51, 53, 57, 58, 60, 61, 63, 64, 67, 68, 69, 70, 72,
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 86, 87, 89, 90, 91,
92, 93, 94, 95, 96, 98, 99])
Upvotes: 2
Reputation: 7816
You can set replace
to False
in np.random.choice
rows = np.random.choice(df.index.values, 1000, replace=False)
Upvotes: 2