Vaibhav Jha
Vaibhav Jha

Reputation: 73

Cohort data transformation in R

I have data frame df with customer names, date of joining, expiry and cohort.

names         dj        exp  cohort
  (fctr)     (date)     (date)   (chr)
1    Tom 2011-05-01 2011-06-22 2011-05
2  David 2011-06-01 2011-07-19 2011-06
3   Jack 2011-05-03 2012-01-03 2011-05
>

names<-c("Tom","David","Jack")
dj<-as.Date(c("2011-05-01","2011-06-01","2011-05-03"))
exp<-as.Date(c("2011-06-22","2011-07-19","2012-01-03"))
df<-data.frame(names,dj,exp)
df$cohort<-format(df$dj,"%Y-%m")
tbl_df(df)

And a vector DateColumns <- seq.Date(as.Date("2011/05/01"), as.Date("2015/12/1"), by = "1 month") with calendar dates from 1st of May 2011 to 1st Dec 2015

From which I would like to check whether the customer is active in a particular calendar month or not. Active is defined as a customer's exp>DateColumns & dj<=DateColumns

Output 1 (Correct):

names dj  exp   cohort               2011-05-01 2011-06-01 2011-07-01 2011-08-01 .

Tom 2011-05-01 2011-06-22 2011-05    TRUE       TRUE       FALSE      FALSE
David 2011-06-01 2011-07-19 2011-06  FALSE      TRUE       TRUE       FALSE
Jack 2011-05-03 2012-01-03 2011-05   TRUE       TRUE       TRUE       TRUE  ....

Below is the code I wrote, unfortunately this not able to compare both expiry date and dj with the calendar date in date columns. For example David should be false in X1. So, how do I do that?

Incorrect output

  names<-c("Tom","David","Jack")
    dj<-as.Date(c("2011-05-01","2011-06-01","2011-05-03"))
    exp<-as.Date(c("2011-06-22","2011-07-19","2012-01-03"))
    df<-data.frame(names,dj,exp)
    df$cohort<-format(df$dj,"%Y-%m")


    DateColumns <- seq.Date(as.Date("2011/05/01"), as.Date("2015/12/1"), by = "1 month")

DateColumnvalues <- t(sapply(df$exp, function(x) x > DateColumns))
df2 <- data.frame(df,DateColumnvalues)
tbl_df(df2)

output:
names         dj        exp  cohort    X1    X2    X3    X4    X5    X6
  (fctr)     (date)     (date)   (chr) (lgl) (lgl) (lgl) (lgl) (lgl) (lgl)
1    Tom 2011-05-01 2011-06-22 2011-05  TRUE  TRUE FALSE FALSE FALSE FALSE
2  David 2011-06-01 2011-07-19 2011-06  **TRUE**  TRUE  TRUE FALSE FALSE FALSE
3   Jack 2011-05-03 2012-01-03 2011-05  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
Variables not shown: X7 (lgl), X8 (lgl), X9 (lgl), X10 (lgl), X11 (lgl),
  X12 (lgl), X13 (lgl), X14 (lgl), X15 (lgl), X16 (lgl), X17 (lgl), X18
  (lgl), X19 (lgl), X20 (lgl), X21 (lgl), X22 (lgl), X23 (lgl), X24 (lgl),
  X25 (lgl), X26 (lgl), X27 (lgl), X28 (lgl), X29 (lgl), X30 (lgl), X31
  (lgl), X32 (lgl), X33 (lgl), X34 (lgl), X35 (lgl), X36 (lgl), X37 (lgl),
  X38 (lgl), X39 (lgl), X40 (lgl), X41 (lgl), X42 (lgl), X43 (lgl), X44
  (lgl), X45 (lgl), X46 (lgl), X47 (lgl), X48 (lgl), X49 (lgl), X50 (lgl),
  X51 (lgl), X52 (lgl), X53 (lgl), X54 (lgl), X55 (lgl), X56 (lgl)
> 

Note: X1 is "2011-05-01" and X2="2011-06-01" so on denoting calendar months

Second, I want to transform this data "relatively" by aggregating based on month of joining and cohort. For example, if a customer "Dick" joined in 2015-Jan and will expire in Dec 2015 his M0 should be set to true, but the M0 considered should Jan 2015 and not calendar month.

Names dj exp cohort                     M0 M1 M2 M3  till M55
Dick  2015-01-11 2015-12-10 2015-01     T  T  T  T
Tom   2011-05-01 2011-06-22 2011-05     T  T  F  F
David 2011-06-01 2011-07-19 2011-06     T  T  F  F

Upvotes: 1

Views: 201

Answers (1)

lukeA
lukeA

Reputation: 54237

As an answer to the first question, you could do

library(data.table)
library(lubridate)
dt <- data.table(df, key=c("dj", "exp"))
dates <- setDT(transform(data.frame(start = seq.Date(as.Date("2011-05-01"), as.Date("2011-08-01"), "1 month")), 
                                    end = start + months(1) - 1), 
               key = c("start", "end"))
dcast(foverlaps(dt, dates)[, val:=TRUE], names+dj+exp+cohort~start, value.var="val", fill=FALSE)
#    names         dj        exp  cohort 2011-05-01 2011-06-01 2011-07-01 2011-08-01
# 1: David 2011-06-01 2011-07-19 2011-06      FALSE       TRUE       TRUE      FALSE
# 2:  Jack 2011-05-03 2012-01-03 2011-05       TRUE       TRUE       TRUE       TRUE
# 3:   Tom 2011-05-01 2011-06-22 2011-05       TRUE       TRUE      FALSE      FALSE

Concerning the 2nd question, If I understood it correctly, I'd go with

lst <- apply(df[2:3], 1, function(x) { x <- as.Date(x); as.logical(seq_along(seq(x[1], x[2], by="month")))  }) 
n <- max(lengths(lst)) 
res <- cbind(df, do.call(rbind, lapply(lst, function(x) `length<-`(x, n) ))) 
res[is.na(res)] <- FALSE; res
#   names         dj        exp  cohort    1    2     3     4     5     6     7     8     9
# 1   Tom 2011-05-01 2011-06-22 2011-05 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# 2 David 2011-06-01 2011-07-19 2011-06 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# 3  Jack 2011-05-03 2012-01-03 2011-05 TRUE TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Upvotes: 1

Related Questions