Reputation: 521
I have the following dataset
> northern_sub
Ratio at Birth Unemployment Graduation Rate Literacy Rate Fertility Rate
1 109.90 1.23 93.25 88.3 2.22
2 110.41 0.87 96.60 89.3 2.21
3 108.20 0.75 99.10 89.2 2.31
4 112.36 0.81 95.93 89.5 2.18
5 116.06 0.77 98.64 89.0 2.56
6 114.25 1.11 93.58 89.9 2.69
7 122.60 1.18 96.28 90.0 2.63
8 117.80 1.02 97.84 89.9 2.53
I create a partition into train and test sets as follow
training_index <- createDataPartition(northern_sub$`Ratio at Birth`, times = 10, p = 0.7, list = F)
training_set <- northern_sub[training_index,] # Training Set
testing_set <- northern_sub[-training_index,] # Test Set
However, the test set is empty
> testing_set
[1] Ratio at Birth Unemployment Graduation Rate Literacy Rate Fertility Rate
<0 rows> (or 0-length row.names)
Is there any way I can fix this issue? Is it because my data frame collected is too small?
The dataframe structure is
> dput(northern_sub)
structure(list(`Ratio at Birth` = c(109.9, 110.41, 108.2, 112.36,
116.06, 114.25, 122.6, 117.8), Unemployment = c(1.23, 0.87, 0.75,
0.81, 0.77, 1.11, 1.18, 1.02), `Graduation Rate` = c(93.25, 96.6,
99.1, 95.93, 98.64, 93.58, 96.28, 97.84), `Literacy Rate` = c(88.3,
89.3, 89.2, 89.5, 89, 89.9, 90, 89.9), `Fertility Rate` = c(2.22,
2.21, 2.31, 2.18, 2.56, 2.69, 2.63, 2.53)), row.names = c(NA,
-8L), class = "data.frame")
Upvotes: 1
Views: 40
Reputation: 18714
The manner in which you called the partitions won't work. When you designated that you wanted 10 subsets, that meant your returned data would have 10 columns. Because you have only 8 rows, it's returning all 8. If you had let's say 100 rows, it would put 70 row indices in each column, but you still wouldn't be able to call the training and testing data in the manner that you did. You have to specify which column in training_index
that you want to use.
tr_set <- northern_sub[training_index[, 1], ]
You still can't call the opposite of this as the testing data, because it returned all 8 rows. Whenever you create or modify an object you should really check the results. Inspect what you expect.
If you had left the parameter times
off of your call to createDataPartition
(and had more data) you could call the training and testing data the way that you did.
Since all of your columns are numeric, you could create data with something like a generative adversarial network.
Upvotes: 1