R2D2_2024
R2D2_2024

Reputation: 51

Is numpys setdiff1d broken?

To select data for training and validation in my machine learning projects, I usually use numpys masking functionality. So a typical reoccuring block of code to select the indices for validation and test data looks like this:

import numpy as np

validation_split = 0.2

all_idx = np.arange(0,100000)
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)))
idxTrain = np.setdiff1d(all_idx, idxValid)

Now the following should always be true:

len(all_idx) == len(idxValid)+len(idxTrain)

Unfortunately, I found out that somehow this is not always the case. As I inrease the number of elements that are chosen from the all_idx-array the resulting numbers do not add up properly. Here another standalone example which breaks as soon as I increase the number of randomly chosen validation indices above 1000:

import numpy as np

all_idx = np.arange(0,100000)
idxValid = np.random.choice(all_idx, 1000)
idxTrain = np.setdiff1d(all_idx, idxValid)

print(len(all_idx), len(idxValid), len(idxTrain))

This results in -> 100000, 1000, 99005

I am confused?! Please try yourself. I would be glad to understand this.

Upvotes: 1

Views: 469

Answers (2)

Giorgos Myrianthous
Giorgos Myrianthous

Reputation: 39840

Consider the following example:

all_idx = np.arange(0, 100)
print(all_idx)
>>> [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]

Now if you print out your validation dataset:

idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)))
print(idxValid)
>>> [31 57 55 45 26 25 55 76 33 69 49 90 46 14 18 30 89 73 47 82]

You can actually observe that there are duplicates in the resulting set and thus

len(all_idx) == len(idxValid)+len(idxTrain)

wouldn't result to True.

What you need to do is to make sure that np.random.choice does a sampling without replcacement by passing replace=False:

idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)), replace=False)

Now the results should be as expected:

import numpy as np

validation_split = 0.2

all_idx = np.arange(0, 100)
print(all_idx)

idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)), replace=False)
print(idxValid)

idxTrain = np.setdiff1d(all_idx, idxValid)
print(idxTrain)

print(len(all_idx) == len(idxValid)+len(idxTrain))

and the output is:

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]

[12 85 96 64 48 21 55 56 80 42 11 92 54 77 49 36 28 31 70 66]

[ 0  1  2  3  4  5  6  7  8  9 10 13 14 15 16 17 18 19 20 22 23 24 25 26
 27 29 30 32 33 34 35 37 38 39 40 41 43 44 45 46 47 50 51 52 53 57 58 59
 60 61 62 63 65 67 68 69 71 72 73 74 75 76 78 79 81 82 83 84 86 87 88 89
 90 91 93 94 95 97 98 99]

True

Consider using train_test_split from scikit-learn which is straight-forward:

from sklearn.model_selection import train_test_split


train, test = train_test_split(df, test_size=0.2)

Upvotes: 1

Nuageux
Nuageux

Reputation: 1686

idxValid = np.random.choice(all_idx, 10, replace=False)

Careful, you need to indicate that you don't want to have duplicates in idxValid. To do so, you just have to had replace=False in np.random.choice

replace boolean, optional
    Whether the sample is with or without replacement

Upvotes: 1

Related Questions