Fee
Fee

Reputation: 89

How to split a data frame into training, validation, and test sets dependent on ID's?

I need a randomised split for my data set into training, validation and test set, such as shown in this post (R: How to split a data frame into training, validation, and test sets?), but it needs to be linked to the splitting subject ID's randomly, not the whole data frame.

When applying the code answered to that question it splits my data frame completely randomly, but I have stacked ID's and need them to stay together or else one subjects data will be distributed over the different sets.

Sorry, if this sounds a bit confusing. Here my data to explain the issue:

df <- c(Contact.ID, Date.Time, Age, Gender, Attendance)

Contact.ID       Date.Time       Age   Gender   Attendance   
1   A       2012-07-06 18:54:48   37    Male         30    
2   A       2012-07-06 20:50:18   37    Male         30    
3   A       2012-08-14 20:18:44   37    Male         30   
4   B       2012-03-15 16:58:15   27  Female         40    
5   B       2012-04-18 10:57:02   27  Female         40    
6   B       2012-04-18 17:31:22   27  Female         40    
7   B       2012-04-18 18:37:00   27  Female         40    
8   C       2013-10-22 17:46:07   40    Male         5    
9   C       2013-10-27 11:21:00   40    Male         5    
10  D       2012-07-28 14:48:33   20  Female         12 

If I split this data randomly, subject A's entries could, for instance, have two in my test set and one in my validation set. But I would need a random split of different ID's not random split of the whole data frame and I can not figure out how to connect these.

Upvotes: 3

Views: 5960

Answers (1)

josliber
josliber

Reputation: 44330

The code you posted from the previous train/validate/test question assigns a train, validate, or test label to each row of a data frame and then splits based on the label of each row:

spec = c(train = .6, test = .2, validate = .2)
g = sample(cut(
  seq(nrow(df)), 
  nrow(df)*cumsum(c(0,spec)),
  labels = names(spec)
))
res = split(df, g)

Instead, you could assign a label to each unique level of your ID factor variable and split based on the label assigned to the ID of each row:

set.seed(144)
spec = c(train = .6, test = .2, validate = .2)
g = sample(cut(
  seq_along(unique(df$Contact.ID)), 
  length(unique(df$Contact.ID))*cumsum(c(0,spec)),
  labels = names(spec)
))
(res = split(df, g[as.factor(df$Contact.ID)]))
# $train
#   Contact.ID          Date.Time Age Gender Attendance
# 1          A 2012-07-0618:54:48  37   Male         30
# 2          A 2012-07-0620:50:18  37   Male         30
# 3          A 2012-08-1420:18:44  37   Male         30
# 8          C 2013-10-2217:46:07  40   Male          5
# 9          C 2013-10-2711:21:00  40   Male          5
# 
# $test
#   Contact.ID          Date.Time Age Gender Attendance
# 4          B 2012-03-1516:58:15  27 Female         40
# 5          B 2012-04-1810:57:02  27 Female         40
# 6          B 2012-04-1817:31:22  27 Female         40
# 7          B 2012-04-1818:37:00  27 Female         40
# 
# $validate
#    Contact.ID          Date.Time Age Gender Attendance
# 10          D 2012-07-2814:48:33  20 Female         12

Note that this changes the interpretation of the split proportions: the 60% assigned to the training set is now 60% of the unique subject IDs, not 60% of the rows.

Upvotes: 4

Related Questions