Reputation: 28169
I have a pandas data frame with 50k rows. I'm trying to add a new column that is a randomly generated integer from 1 to 5.
If I want 50k random numbers I'd use:
df1['randNumCol'] = random.sample(range(50000), len(df1))
but for this I'm not sure how to do it.
Side note in R, I'd do:
sample(1:5, 50000, replace = TRUE)
Any suggestions?
Upvotes: 114
Views: 174724
Reputation: 23281
randint
is fine to generate small arrays but for larger arrays, Numpy's random Generators such as Generator.integers
is faster especially if the range of integers to choose from is large. To use it, construct the numpy.random.default_rng()
and call the appropriate method e.g. integers
, choice
, normal
, standard_normal
etc. The following is an example where len(df1)
number of pseudo-random integers between 1 and 4 are generated and assigned to a column.
import numpy as np
df1['randNumCol'] = np.random.default_rng().integers(1, 5, len(df1))
For a reproducible array of numbers, you can set a random seed in the generator in the same line:
df1['randNumCol'] = np.random.default_rng(2023).integers(1, 5, len(df1))
# ^^^^ <--- set seed here
If the range starts from 0 or if the range is not consecutive, then Generator.choice
could be used (and it is much faster than choice
):
# sample from numbers from 0 to 4
rng = np.random.default_rng()
df1['randNumCol'] = rng.choice(5, len(df1))
# sample from the given list
df1['randNumCol'] = rng.choice([1, 2, 4], len(df1))
As the following timeit test shows, Generator.integers
is about 60% faster than randint
.
df1 = pd.DataFrame(index=range(100_000_000))
%timeit df1['randNumCol'] = np.random.randint(1, 50, len(df1))
# 1.43 s ± 23.3 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
%timeit df1['randNumCol'] = np.random.default_rng().integers(1, 50, len(df1))
# 886 ms ± 31.7 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
Upvotes: 1
Reputation: 33960
To add a column of random integers, use randint(low, high, size)
. There's no need to waste memory allocating range(low, high)
which is what that used to do in Python 2.x; that could be a lot of memory if high
is large.
df1['randNumCol'] = np.random.randint(0,5, size=len(df1))
Notes:
size
is just an integer. In general if we want to generate an array/dataframe of randint()s
, size can be a tuple, as in Pandas: How to create a data frame of random integers?)range(low, high)
no longer allocates a list (potentially using lots of memory), it produces a range()
objectrandom.seed(...)
beforehand, for determinism and reproducibilityUpvotes: 37
Reputation: 1182
An option that doesn't require an additional import for numpy:
df1['randNumCol'] = pd.Series(range(1,6)).sample(int(5e4), replace=True).array
Upvotes: 5
Reputation: 17639
One solution is to use numpy.random.randint
:
import numpy as np
df1['randNumCol'] = np.random.randint(1, 6, df1.shape[0])
Or if the numbers are non-consecutive (albeit slower), you can use this:
df1['randNumCol'] = np.random.choice([1, 9, 20], df1.shape[0])
In order to make the results reproducible you can set the seed with numpy.random.seed
(e.g. np.random.seed(42)
)
Upvotes: 170