iskandarblue
iskandarblue

Reputation: 7526

cohort analysis in R

I have a dataset test of user ids and the quarter in which user ids made a transaction:

> test
   id quarter
1   7      Q1
2   7      Q1
3   5      Q1
4   8      Q1
5   3      Q1
6   6      Q1
7  10      Q1
8   3      Q2
9  10      Q2
10  8      Q2
11  2      Q2
12  7      Q2
13  6      Q2
14  6      Q3
15  9      Q3
16  6      Q3
17  4      Q3
18  9      Q3
19  2      Q3
20  5      Q4
21  8      Q4
22 10      Q4
23  7      Q4
24  1      Q4
25  8      Q4

My objective is to find the number of unique users that were active in Quarter 1 and, of those users, get the number of users that made transactions in subsequent quarters.

The output from the above sample should be:

> output
  quarter unique_cohort_ids
1      Q1                 6
2      Q2                 5
3      Q3                 1
4      Q4                 4

The following code works, but it is verbose and repetitive, and is not ideal for a larger dataset with dozens of quarters. Any suggestions on how to improve it would be appreciated:

subsets <-split(test, test$quarter, drop=TRUE)

for (i in 1:length(subsets)) {
  assign(paste0("Q", i), as.data.frame(subsets[[i]]))
}

Q1_ids <- sum(uniqueQ1$id)
Q2_ids <- sum(unique(Q1$id) %in% unique(Q2$id))
Q3_ids <- sum(unique(Q1$id) %in% unique(Q3$id))
Q4_ids <- sum(unique(Q1$id) %in% unique(Q4$id)) 

Cohort analysis also includes counting the number of new users per quarter. As opposed to manually counting the unique user ids that appear per quarter that had not appeared in any previous quarters, I am seeking a programmatic solution. In this case, the output would be the following list:

> mylist
[[1]]
[1]  7  5  8  3  6 10

[[2]]
[1] 2

[[3]]
[1] 9 4

[[4]]
[1] 1

Any suggestions would be appreciated

Here is the sample data:

> dput(test)
structure(list(id = c(7, 7, 5, 8, 3, 6, 10, 3, 10, 8, 2, 7, 6, 
6, 9, 6, 4, 9, 2, 5, 8, 10, 7, 1, 8), quarter = c("Q1", "Q1", 
"Q1", "Q1", "Q1", "Q1", "Q1", "Q2", "Q2", "Q2", "Q2", "Q2", "Q2", 
"Q3", "Q3", "Q3", "Q3", "Q3", "Q3", "Q4", "Q4", "Q4", "Q4", "Q4", 
"Q4")), .Names = c("id", "quarter"), row.names = c(NA, -25L), class = "data.frame")

Upvotes: 2

Views: 656

Answers (3)

emilliman5
emilliman5

Reputation: 5966

Here's an example using data.table

library(data.table)
setDT(test)[, sum(unique(id) %in% test[quarter=="Q1", id]), by = quarter]

#   quarter V1
#1:      Q1  6
#2:      Q2  5
#3:      Q3  1
#4:      Q4  4

And for the second part of the analysis:

split(test$id[!duplicated(test$id)], test$quarter[!duplicated(test$id)])

#$Q1
#[1]  7  5  8  3  6 10

#$Q2
#[1] 2

#$Q3
#[1] 9 4

#$Q4
#[1] 1

Update based on new requirements:

This seems a bit clunky to me but was the easiest way to keep track of the ids seen as you progress through the quarters.

quarts <- sort(unique(test$quarter))
test$occur <- 1
mat <- dcast.data.table(test, id ~ quarter, value.var = "occur", fun.aggregate = sum)

res <- mat[Q1 >0, lapply(.SD, function(x) sum(x>0)), .SDcols=colnames(mat)[-1]] ##initalize the results with "Q1"
cumMat <- mat$Q1 ##this will keep track of the ids that were seen in previous quarters  

for(i in quarts[-1]){                                  ##foreach quarter (except ("Q1") we will count the number of unique ids that are in that quarter and not in any previous quarters.
  res <- rbind(res, mat[cumMat == 0 & get(i) > 0, lapply(.SD, function(x) sum(x>0)), .SDcols=colnames(mat)[-1]])
  cumMat <- rowSums(cbind(cumMat, mat[, i, with = F])) ##update the ids seen
}
t(res) 

#   [,1] [,2] [,3] [,4]
#Q1    6    0    0    0
#Q2    5    1    0    0
#Q3    1    1    2    0
#Q4    4    0    0    1

Upvotes: 2

Hao
Hao

Reputation: 7856

How about this approach?

library(tidyverse)
test %>% 
distinct() %>% 
mutate(value = T) %>% 
spread(quarter, value) %>% 
filter(!is.na(Q1)) %>%
select(-id) %>%
colSums(na.rm = T)
# Q1 Q2 Q3 Q4 
# 6  5  1  4 

Upvotes: 1

Nathan Werth
Nathan Werth

Reputation: 5273

For the unique cohort counts per quarter:

q1_ids <- test[test$quarter == 'Q1', 'id']
test_from_q1 <- test[test$id %in% q1_ids, ]
unique_from_q1 <- unique(test_from_q1)
quarter_counts <- table(unique_from_q1$quarter)
output <- as.data.frame(quarter_counts)
names(output) <- c('quarter', 'unique_cohort_ids')

For the second part, would a data.frame be acceptable?

unique_ids <- unique(test$id)
first_appearance <- data.frame(
  id = unique_ids,
  quarter = test$quarter[match(unique_ids, test$id)]
)

If not, then a tapply can list-ify it:

tapply(
    first_appearance$id,
    first_appearance$quarter,
    'identity',
    simplify = FALSE
)

Upvotes: 0

Related Questions