Reputation: 181
I am trying to split a dataset in 80/20 - training and testing sets. I am trying to split by location, which is a factor with 4 levels, however each level has not been sampled equally. Out of 1892 samples -
Location1: 172
Location2: 615
Location3: 603
Location4: 502
I am trying to split the whole dataset 80/20, as mentioned above, but I also want each location to be split 80/20 so that I get an even proportion from each location in the training and testing set. I've seen one post about this using stratified
function from the splitstackshape
package but it doesn't seem to want to split my factors up.
Here is a simplified reproducible example -
x <- c(1, 2, 3, 4, 1, 3, 7, 4, 5, 7, 8, 9, 4, 6, 7, 9, 7, 1, 5, 6)
xx <- c("A", "A", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "D", "D", "D", "D", "D")
df <- data.frame(x, xx)
validIndex <- stratified(df, "xx", size=16/nrow(df))
valid <- df[-validIndex,]
train <- df[validIndex,]
where A
, B
, C
, D
correspond to the factors in the approximate proportions as the actual dataset (~ 10, 32, 32, and 26%, respectively)
Upvotes: 1
Views: 994
Reputation: 1754
Using bothSets
should return you a list containing the split of the original data frame into validation and training set (whose union should be the original data frame):
splt <- stratified(df, "xx", size=16/nrow(df), replace=FALSE, bothSets=TRUE)
valid <- splt[[1]]
train <- splt[[2]]
## check
df2 <- as.data.frame(do.call("rbind",splt))
all.equal(df[with(df, order(xx, x)), ],
df2[with(df2, order(xx, x)), ],
check.names=FALSE)
Upvotes: 1