Reputation: 28169

Pandas: create new column in df with random integers from range

I have a pandas data frame with 50k rows. I'm trying to add a new column that is a randomly generated integer from 1 to 5.

If I want 50k random numbers I'd use:

df1['randNumCol'] = random.sample(range(50000), len(df1))

but for this I'm not sure how to do it.

Side note in R, I'd do:

sample(1:5, 50000, replace = TRUE)

Any suggestions?

Upvotes: 114

Answers (4)

cottontail

Reputation: 23281

randint is fine to generate small arrays but for larger arrays, Numpy's random Generators such as Generator.integers is faster especially if the range of integers to choose from is large. To use it, construct the numpy.random.default_rng() and call the appropriate method e.g. integers, choice, normal, standard_normal etc. The following is an example where len(df1) number of pseudo-random integers between 1 and 4 are generated and assigned to a column.

import numpy as np

df1['randNumCol'] = np.random.default_rng().integers(1, 5, len(df1))

For a reproducible array of numbers, you can set a random seed in the generator in the same line:

df1['randNumCol'] = np.random.default_rng(2023).integers(1, 5, len(df1))
#                                         ^^^^  <--- set seed here

If the range starts from 0 or if the range is not consecutive, then Generator.choice could be used (and it is much faster than choice):

# sample from numbers from 0 to 4
rng = np.random.default_rng()
df1['randNumCol'] = rng.choice(5, len(df1))

# sample from the given list
df1['randNumCol'] = rng.choice([1, 2, 4], len(df1))

As the following timeit test shows, Generator.integers is about 60% faster than randint.

df1 = pd.DataFrame(index=range(100_000_000))

%timeit df1['randNumCol'] = np.random.randint(1, 50, len(df1))
# 1.43 s ± 23.3 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

%timeit df1['randNumCol'] = np.random.default_rng().integers(1, 50, len(df1))
# 886 ms ± 31.7 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

Upvotes: 1

smci

Reputation: 33960

To add a column of random integers, use randint(low, high, size). There's no need to waste memory allocating range(low, high) which is what that used to do in Python 2.x; that could be a lot of memory if high is large.

df1['randNumCol'] = np.random.randint(0,5, size=len(df1))

Notes:

when we're just adding a single column, size is just an integer. In general if we want to generate an array/dataframe of randint()s, size can be a tuple, as in Pandas: How to create a data frame of random integers?)
in Python 3.x range(low, high) no longer allocates a list (potentially using lots of memory), it produces a range() object
use random.seed(...) beforehand, for determinism and reproducibility

Upvotes: 37

shortorian

Reputation: 1182

An option that doesn't require an additional import for numpy:

df1['randNumCol'] = pd.Series(range(1,6)).sample(int(5e4), replace=True).array

Upvotes: 5

Matt

Reputation: 17639

One solution is to use numpy.random.randint:

import numpy as np
df1['randNumCol'] = np.random.randint(1, 6, df1.shape[0])

Or if the numbers are non-consecutive (albeit slower), you can use this:

df1['randNumCol'] = np.random.choice([1, 9, 20], df1.shape[0])

In order to make the results reproducible you can set the seed with numpy.random.seed (e.g. np.random.seed(42))

Upvotes: 170

Pandas: create new column in df with random integers from range

Answers (4)

Related Questions