screechOwl
screechOwl

Reputation: 28169

Pandas: create new column in df with random integers from range

I have a pandas data frame with 50k rows. I'm trying to add a new column that is a randomly generated integer from 1 to 5.

If I want 50k random numbers I'd use:

df1['randNumCol'] = random.sample(range(50000), len(df1))

but for this I'm not sure how to do it.

Side note in R, I'd do:

sample(1:5, 50000, replace = TRUE)

Any suggestions?

Upvotes: 114

Views: 174724

Answers (4)

cottontail
cottontail

Reputation: 23281

randint is fine to generate small arrays but for larger arrays, Numpy's random Generators such as Generator.integers is faster especially if the range of integers to choose from is large. To use it, construct the numpy.random.default_rng() and call the appropriate method e.g. integers, choice, normal, standard_normal etc. The following is an example where len(df1) number of pseudo-random integers between 1 and 4 are generated and assigned to a column.

import numpy as np

df1['randNumCol'] = np.random.default_rng().integers(1, 5, len(df1))

For a reproducible array of numbers, you can set a random seed in the generator in the same line:

df1['randNumCol'] = np.random.default_rng(2023).integers(1, 5, len(df1))
#                                         ^^^^  <--- set seed here

If the range starts from 0 or if the range is not consecutive, then Generator.choice could be used (and it is much faster than choice):

# sample from numbers from 0 to 4
rng = np.random.default_rng()
df1['randNumCol'] = rng.choice(5, len(df1))

# sample from the given list
df1['randNumCol'] = rng.choice([1, 2, 4], len(df1))

As the following timeit test shows, Generator.integers is about 60% faster than randint.

df1 = pd.DataFrame(index=range(100_000_000))

%timeit df1['randNumCol'] = np.random.randint(1, 50, len(df1))
# 1.43 s ± 23.3 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

%timeit df1['randNumCol'] = np.random.default_rng().integers(1, 50, len(df1))
# 886 ms ± 31.7 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

Upvotes: 1

smci
smci

Reputation: 33960

To add a column of random integers, use randint(low, high, size). There's no need to waste memory allocating range(low, high) which is what that used to do in Python 2.x; that could be a lot of memory if high is large.

df1['randNumCol'] = np.random.randint(0,5, size=len(df1))

Notes:

Upvotes: 37

shortorian
shortorian

Reputation: 1182

An option that doesn't require an additional import for numpy:

df1['randNumCol'] = pd.Series(range(1,6)).sample(int(5e4), replace=True).array

Upvotes: 5

Matt
Matt

Reputation: 17639

One solution is to use numpy.random.randint:

import numpy as np
df1['randNumCol'] = np.random.randint(1, 6, df1.shape[0])

Or if the numbers are non-consecutive (albeit slower), you can use this:

df1['randNumCol'] = np.random.choice([1, 9, 20], df1.shape[0])

In order to make the results reproducible you can set the seed with numpy.random.seed (e.g. np.random.seed(42))

Upvotes: 170

Related Questions