Reputation: 55866
I have a DataFrame
loaded from a .tsv
file. I wanted to generate some exploratory plots. The problem is that the data set is large (~1 million rows), so there are too many points on the plot to see a trend. Plus, it is taking a while to plot.
I wanted to sub-sample 10000 randomly distributed rows. This should be reproducible so the same sequence of random numbers is generated in each run.
This: Sample two pandas dataframes the same way seems to be on the right track, but I cannot guarantee the subsample size.
Upvotes: 17
Views: 29721
Reputation: 139172
You can select random elements from the index with np.random.choice
. Eg to select 5 random rows:
df = pd.DataFrame(np.random.rand(10))
df.loc[np.random.choice(df.index, 5, replace=False)]
This function is new in 1.7. If you want a solution with an older numpy, you can shuffle the data and taken the first elements of that:
df.loc[np.random.permutation(df.index)[:5]]
In this way your DataFrame is not sorted anymore, but if this is needed for plotting (for example, a line plot), you can simply do .sort()
afterwards.
Upvotes: 19
Reputation: 70847
These days, one can simply use the sample
method on a DataFrame:
>>> help(df.sample)
Help on method sample in module pandas.core.generic:
sample(self, n=None, frac=None, replace=False, weights=None, random_state=None, axis=None) method of pandas.core.frame.DataFrame instance
Returns a random sample of items from an axis of object.
Replicability can be achieved by using the random_state
keyword:
>>> len(set(df.sample(n=1, random_state=np.random.RandomState(0)).iterations.values[0] for _ in xrange(1000)))
1
>>> len(set(df.sample(n=1).iterations.values[0] for _ in xrange(1000)))
40
Upvotes: 14
Reputation: 375485
Unfortunately np.random.choice
appears to be quite slow for small samples (less than 10% of all rows), you may be better off using plain ol' sample:
from random import sample
df.loc[sample(df.index, 1000)]
For large DataFrame (a million rows), we see small samples:
In [11]: %timeit df.loc[sample(df.index, 10)]
1000 loops, best of 3: 1.19 ms per loop
In [12]: %timeit df.loc[np.random.choice(df.index, 10, replace=False)]
1 loops, best of 3: 1.36 s per loop
In [13]: %timeit df.loc[np.random.permutation(df.index)[:10]]
1 loops, best of 3: 1.38 s per loop
In [21]: %timeit df.loc[sample(df.index, 1000)]
10 loops, best of 3: 14.5 ms per loop
In [22]: %timeit df.loc[np.random.choice(df.index, 1000, replace=False)]
1 loops, best of 3: 1.28 s per loop
In [23]: %timeit df.loc[np.random.permutation(df.index)[:1000]]
1 loops, best of 3: 1.3 s per loop
But around 10% it gets about the same:
In [31]: %timeit df.loc[sample(df.index, 100000)]
1 loops, best of 3: 1.63 s per loop
In [32]: %timeit df.loc[np.random.choice(df.index, 100000, replace=False)]
1 loops, best of 3: 1.36 s per loop
In [33]: %timeit df.loc[np.random.permutation(df.index)[:100000]]
1 loops, best of 3: 1.4 s per loop
and if you are sampling everything (don't use sample!):
In [41]: %timeit df.loc[sample(df.index, 1000000)]
1 loops, best of 3: 10 s per loop
Note: both numpy.random and random accept a seed, to reproduce randomly generated output.
As @joris points out in the comments, choice (without replacement) is actually sugar for permutation so it's no suprise it's constant time and slower for smaller samples...
Upvotes: 18