Reputation: 89
I need a randomised split for my data set into training, validation and test set, such as shown in this post (R: How to split a data frame into training, validation, and test sets?), but it needs to be linked to the splitting subject ID's randomly, not the whole data frame.
When applying the code answered to that question it splits my data frame completely randomly, but I have stacked ID's and need them to stay together or else one subjects data will be distributed over the different sets.
Sorry, if this sounds a bit confusing. Here my data to explain the issue:
df <- c(Contact.ID, Date.Time, Age, Gender, Attendance)
Contact.ID Date.Time Age Gender Attendance
1 A 2012-07-06 18:54:48 37 Male 30
2 A 2012-07-06 20:50:18 37 Male 30
3 A 2012-08-14 20:18:44 37 Male 30
4 B 2012-03-15 16:58:15 27 Female 40
5 B 2012-04-18 10:57:02 27 Female 40
6 B 2012-04-18 17:31:22 27 Female 40
7 B 2012-04-18 18:37:00 27 Female 40
8 C 2013-10-22 17:46:07 40 Male 5
9 C 2013-10-27 11:21:00 40 Male 5
10 D 2012-07-28 14:48:33 20 Female 12
If I split this data randomly, subject A's entries could, for instance, have two in my test set and one in my validation set. But I would need a random split of different ID's not random split of the whole data frame and I can not figure out how to connect these.
Upvotes: 3
Views: 5960
Reputation: 44330
The code you posted from the previous train/validate/test question assigns a train, validate, or test label to each row of a data frame and then splits based on the label of each row:
spec = c(train = .6, test = .2, validate = .2)
g = sample(cut(
seq(nrow(df)),
nrow(df)*cumsum(c(0,spec)),
labels = names(spec)
))
res = split(df, g)
Instead, you could assign a label to each unique level of your ID factor variable and split based on the label assigned to the ID of each row:
set.seed(144)
spec = c(train = .6, test = .2, validate = .2)
g = sample(cut(
seq_along(unique(df$Contact.ID)),
length(unique(df$Contact.ID))*cumsum(c(0,spec)),
labels = names(spec)
))
(res = split(df, g[as.factor(df$Contact.ID)]))
# $train
# Contact.ID Date.Time Age Gender Attendance
# 1 A 2012-07-0618:54:48 37 Male 30
# 2 A 2012-07-0620:50:18 37 Male 30
# 3 A 2012-08-1420:18:44 37 Male 30
# 8 C 2013-10-2217:46:07 40 Male 5
# 9 C 2013-10-2711:21:00 40 Male 5
#
# $test
# Contact.ID Date.Time Age Gender Attendance
# 4 B 2012-03-1516:58:15 27 Female 40
# 5 B 2012-04-1810:57:02 27 Female 40
# 6 B 2012-04-1817:31:22 27 Female 40
# 7 B 2012-04-1818:37:00 27 Female 40
#
# $validate
# Contact.ID Date.Time Age Gender Attendance
# 10 D 2012-07-2814:48:33 20 Female 12
Note that this changes the interpretation of the split proportions: the 60% assigned to the training set is now 60% of the unique subject IDs, not 60% of the rows.
Upvotes: 4