Reputation: 121
I have this DF :-
df = structure(list(session_id = c(1105L, 1105L, 1105L, 1107L, 1107L,
1107L, 1108L, 1108L, 1108L, 1109L, 1109L, 1109L, 1110L, 1110L,
1110L, 1111L, 1111L, 1111L, 1111L, 1112L, 1112L, 1112L, 1112L,
1114L, 1114L, 1114L, 1114L), datetime = structure(c(1457483622,
1457483623, 1457483625, 1457484264, 1457484266, 1457484269, 1457484842,
1457484844, 1457484846, 1457485297, 1457485299, 1457485300, 1457485369,
1457485369, 1457485371, 1457486315, 1457486316, 1457486316, 1457486318,
1457486477, 1457486480, 1457486480, 1457486481, 1457486997, 1457486997,
1457486998, 1457487001), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
request = c(8, 3, 3, 14, 14, 7, 9, 10, 10, 17, 6, 6, 10,
8, 5, 9, 11, 14, 16, 21, 11, 1, 19, 7, 4, 13, 20)), .Names = c("session_id",
"datetime", "request"), row.names = c(NA, -27L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
i am trying to grouping this data by session_id and 50% requests go into training set(Train) and rest 50% goes into test set(Test).
As you see session_id = 1105 contains 3 entries so we divide it into half(50%) which gives 1.5 we approx it as 2 (next positive integer) ...so in Train col we have 8,3 and in Test col contains 3 ...........and do the same for rest session_ids
Upvotes: 0
Views: 44
Reputation: 39174
We can use the sample_frac
function from the dplyr package. slice(1:round(n() * 0.5))
is to specify the sample of the first 50% of the rows. After creating the df_train
, we can then use anti_join
to create df_test
.
library(dplyr)
# Create ID by row and group data by session_id
df <- df %>%
mutate(ID = 1:n()) %>%
group_by(session_id)
# Take the first 50% sample of each group
df_train <- df %>%
slice(1:round(n() * 0.5)) %>%
ungroup()
# Filter out those records
df_test <- df %>%
anti_join(df_train, by = "ID") %>%
ungroup()
head(df_train)
# # A tibble: 6 x 4
# session_id datetime request ID
# <int> <dttm> <dbl> <int>
# 1 1105 2016-03-09 00:33:42 8 1
# 2 1105 2016-03-09 00:33:43 3 2
# 3 1107 2016-03-09 00:44:24 14 4
# 4 1107 2016-03-09 00:44:26 14 5
# 5 1108 2016-03-09 00:54:02 9 7
# 6 1108 2016-03-09 00:54:04 10 8
head(df_test)
# A tibble: 6 x 4
# session_id datetime request ID
# <int> <dttm> <dbl> <int>
# 1 1105 2016-03-09 00:33:45 3 3
# 2 1107 2016-03-09 00:44:29 7 6
# 3 1108 2016-03-09 00:54:06 10 9
# 4 1109 2016-03-09 01:01:40 6 12
# 5 1110 2016-03-09 01:02:51 5 15
# 6 1111 2016-03-09 01:18:36 14 18
Upvotes: 1