Reputation: 7742
Here is my dataframe:
df <- data.frame(col_1 = c('11/13/2007', '11/17/2007', '11/19/2007', '11/25/2007', '11/28/2007'),
col_2 = c('A', 'B', 'C', 'D', 'E'))
I would like to add column, which would group elements using time difference of dates in col_1
. For example first, second and third rows will be in group 1, since dates differ only by less than 5 days (between each consecutive dates) and row four and five will be in group 2. We will get two groups since two consecutive dates '11/19/2007' and '11/25/2007' differ by more than 5 days.
I can compute day difference between dates, but now sure how to create grouping. I would prefer solution with dplyr
, but any piece of advice is appreciated.
Upvotes: 3
Views: 1805
Reputation: 2806
You can do this really quickly with base R.
library(data.table)
df <- data.frame(col_1 = c('11/13/2007', '11/17/2007', '11/19/2007', '11/25/2007', '11/28/2007'),
col_2 = c('A', 'B', 'C', 'D', 'E'))
df$col_1 = as.Date(df$col_1, format = "%m/%d/%Y")
df$group = rleid(cumsum(c(0, diff.Date(df$col_1) > 5)))
> df
col_1 col_2 group
1 2007-11-13 A 1
2 2007-11-17 B 1
3 2007-11-19 C 1
4 2007-11-25 D 2
5 2007-11-28 E 2
Upvotes: 1
Reputation: 1345
This approach creates a final product with some redundant info, so there is definitely a more efficient way to go about it, but this seems to satisfy your dilemma:
## generate data
df <- data.frame(col_1 = c('11/13/2007',
'11/17/2007',
'11/19/2007',
'11/25/2007',
'11/28/2007'),
col_2 = c('A', 'B', 'C', 'D', 'E'))
## convert date to date class
df$col_1 <- as.Date(as.character(df$col_1), format = "%m/%d/%Y")
## define function for difftime variable
foo <- function(d1, d) sapply(d, function(x) difftime(d1, x))
## apply function to each observation and convert to data frame
dfdat <- structure(
data.frame(sapply(df$col_1, foo, df$col_1)),
names = as.character(df$col_1))
## combine with original data
df <- cbind(df, dfdat)
## use tidyr package to make long form
df <- tidyr::gather(df, referent, difftime, -col_1, -col_2)
## load dplyr
library(dplyr)
## use dplyr to mutate and group
df %>%
dplyr::mutate(referent = as.Date(referent)) %>%
dplyr::group_by(difftime)
Source: local data frame [25 x 4] Groups: difftime [19]
col_1 col_2 referent difftime <date> <fctr> <date> <dbl> 1 2007-11-13 A 2007-11-13 0 2 2007-11-17 B 2007-11-13 -4 3 2007-11-19 C 2007-11-13 -6 4 2007-11-25 D 2007-11-13 -12 5 2007-11-28 E 2007-11-13 -15 6 2007-11-13 A 2007-11-17 4 7 2007-11-17 B 2007-11-17 0 8 2007-11-19 C 2007-11-17 -2 9 2007-11-25 D 2007-11-17 -8 10 2007-11-28 E 2007-11-17 -11 # ... with 15 more rows
Upvotes: 1
Reputation: 14370
I think you could create the groups without have to do anything particularly fancy.
First we clean col_1
then get the groups. Note I create lag_time_diff
to help with readability but you can choose to put it directly in the cumsum
if you want.
df$col_1 <- as.POSIXct(df$col_1, format = "%m/%d/%Y")
lag_time_diff <- difftime(df$col_1, lag(df$col_1, default = df$col_1[1]), units = "days")
df$group <- cumsum(ifelse(lag_time_diff>5,1,0))
df
# col_1 col_2 group
#1 2007-11-13 A 0
#2 2007-11-17 B 0
#3 2007-11-19 C 0
#4 2007-11-25 D 1
#5 2007-11-28 E 1
All this does is check if the lagged difference in times is >5, if it is it indexes by 1 otherwise it keeps the same value.
Upvotes: 3