lmngn23
lmngn23

Reputation: 521

createDataPartition giving empty test set

I have the following dataset

> northern_sub
  Ratio at Birth Unemployment Graduation Rate Literacy Rate Fertility Rate
1         109.90         1.23           93.25          88.3           2.22
2         110.41         0.87           96.60          89.3           2.21
3         108.20         0.75           99.10          89.2           2.31
4         112.36         0.81           95.93          89.5           2.18
5         116.06         0.77           98.64          89.0           2.56
6         114.25         1.11           93.58          89.9           2.69
7         122.60         1.18           96.28          90.0           2.63
8         117.80         1.02           97.84          89.9           2.53

I create a partition into train and test sets as follow

training_index <- createDataPartition(northern_sub$`Ratio at Birth`, times = 10, p = 0.7, list = F)
training_set <- northern_sub[training_index,] # Training Set
testing_set <- northern_sub[-training_index,] # Test Set

However, the test set is empty

> testing_set
[1] Ratio at Birth  Unemployment    Graduation Rate Literacy Rate   Fertility Rate 
<0 rows> (or 0-length row.names)

Is there any way I can fix this issue? Is it because my data frame collected is too small?

The dataframe structure is

> dput(northern_sub)
structure(list(`Ratio at Birth` = c(109.9, 110.41, 108.2, 112.36, 
116.06, 114.25, 122.6, 117.8), Unemployment = c(1.23, 0.87, 0.75, 
0.81, 0.77, 1.11, 1.18, 1.02), `Graduation Rate` = c(93.25, 96.6, 
99.1, 95.93, 98.64, 93.58, 96.28, 97.84), `Literacy Rate` = c(88.3, 
89.3, 89.2, 89.5, 89, 89.9, 90, 89.9), `Fertility Rate` = c(2.22, 
2.21, 2.31, 2.18, 2.56, 2.69, 2.63, 2.53)), row.names = c(NA, 
-8L), class = "data.frame")

Upvotes: 1

Views: 40

Answers (1)

Kat
Kat

Reputation: 18714

The manner in which you called the partitions won't work. When you designated that you wanted 10 subsets, that meant your returned data would have 10 columns. Because you have only 8 rows, it's returning all 8. If you had let's say 100 rows, it would put 70 row indices in each column, but you still wouldn't be able to call the training and testing data in the manner that you did. You have to specify which column in training_index that you want to use.

tr_set <- northern_sub[training_index[, 1], ]

You still can't call the opposite of this as the testing data, because it returned all 8 rows. Whenever you create or modify an object you should really check the results. Inspect what you expect.

If you had left the parameter times off of your call to createDataPartition (and had more data) you could call the training and testing data the way that you did.

Since all of your columns are numeric, you could create data with something like a generative adversarial network.

Upvotes: 1

Related Questions