Reputation: 4928
I am trying to understand this line of code:
msk = np.random.rand(len(df)) < 0.8
From my understanding, numpy.random.rand(len(df))
returns an array of numbers between [0, 1), generated from the uniform distribution.
What does each number represent in the array? Are the values percentiles of the data?
After doing that, we get array of boolean objects, then create train, test sets.
train = cdf[msk]
test = cdf[~msk]
In this code, for each column in cdf
is it matching each boolean object in an array msk
and if it is True
, it takes that row and put it into train
? and if False
into test
set?
I want to know if my understanding is correct
Upvotes: 2
Views: 5206
Reputation: 1
np.random.rand(len(df))
=>
In summary, the above line of code is just a selection criteria generator where the length of df is used to generate random values between 0 and 1.
msk = np.random.rand(len(df)) < 0.8
=>
msk
returns a Boolean array where the random values less than 0.8 are true and vice versa.
train = cdf[msk]
=> This would return the cdf data frame index where msk
value is True,
test = cdf[~msk]
=> This would return the cdf data frame index where msk
value is False,
thus splitting the data frame cdf into training data(80%) and test-data(20%)
Upvotes: 0
Reputation: 5686
np.random.rand(len(df))
randomly samples len(df)
floating point numbers from the uniform (0, 1) distribution. Sampling from this distribution generates numbers between 0 and 1.
msk
is a boolean array.
msk[i]
is True
if the i
-th value randomly generated by np.random.rand
is less than (<) 0.8
.
msk[i]
is False
if the i
-th value randomly generated by np.random.rand
is greater than or equal to (>=) 0.8
.
~msk
flips True
to False
and False
to True
. With this, the values of cdf
where msk
is True
are assigned to the array train
, and the values of cdf
where msk
is False
are assigned to test
.
With this set-up you'd expect approximately 80% of cdf
to be partitioned into train
, and the remaining ~20% to be partitioned into test
.
Upvotes: 6