haneulkim
haneulkim

Reputation: 4928

Partition train / test data with np.random.rand

I am trying to understand this line of code:

msk = np.random.rand(len(df)) < 0.8

From my understanding, numpy.random.rand(len(df)) returns an array of numbers between [0, 1), generated from the uniform distribution.

What does each number represent in the array? Are the values percentiles of the data?

After doing that, we get array of boolean objects, then create train, test sets.

train = cdf[msk]
test = cdf[~msk]

In this code, for each column in cdf is it matching each boolean object in an array msk and if it is True, it takes that row and put it into train? and if False into test set?

I want to know if my understanding is correct

Upvotes: 2

Views: 5206

Answers (2)

Uchenna Nwajideobi
Uchenna Nwajideobi

Reputation: 1

np.random.rand(len(df)) => In summary, the above line of code is just a selection criteria generator where the length of df is used to generate random values between 0 and 1.

msk = np.random.rand(len(df)) < 0.8 => msk returns a Boolean array where the random values less than 0.8 are true and vice versa.

train = cdf[msk] => This would return the cdf data frame index where msk value is True,

test = cdf[~msk] => This would return the cdf data frame index where msk value is False, thus splitting the data frame cdf into training data(80%) and test-data(20%)

enter image description here

Upvotes: 0

jwalton
jwalton

Reputation: 5686

np.random.rand(len(df)) randomly samples len(df) floating point numbers from the uniform (0, 1) distribution. Sampling from this distribution generates numbers between 0 and 1.

msk is a boolean array.

msk[i] is True if the i-th value randomly generated by np.random.rand is less than (<) 0.8.

msk[i] is False if the i-th value randomly generated by np.random.rand is greater than or equal to (>=) 0.8.

~msk flips True to False and False to True. With this, the values of cdf where msk is True are assigned to the array train, and the values of cdf where msk is False are assigned to test.

With this set-up you'd expect approximately 80% of cdf to be partitioned into train, and the remaining ~20% to be partitioned into test.

Upvotes: 6

Related Questions