Reputation: 7526
I have a dataset test
of user ids and the quarter in which user ids made a transaction:
> test
id quarter
1 7 Q1
2 7 Q1
3 5 Q1
4 8 Q1
5 3 Q1
6 6 Q1
7 10 Q1
8 3 Q2
9 10 Q2
10 8 Q2
11 2 Q2
12 7 Q2
13 6 Q2
14 6 Q3
15 9 Q3
16 6 Q3
17 4 Q3
18 9 Q3
19 2 Q3
20 5 Q4
21 8 Q4
22 10 Q4
23 7 Q4
24 1 Q4
25 8 Q4
My objective is to find the number of unique users that were active in Quarter 1 and, of those users, get the number of users that made transactions in subsequent quarters.
The output from the above sample should be:
> output
quarter unique_cohort_ids
1 Q1 6
2 Q2 5
3 Q3 1
4 Q4 4
The following code works, but it is verbose and repetitive, and is not ideal for a larger dataset with dozens of quarters. Any suggestions on how to improve it would be appreciated:
subsets <-split(test, test$quarter, drop=TRUE)
for (i in 1:length(subsets)) {
assign(paste0("Q", i), as.data.frame(subsets[[i]]))
}
Q1_ids <- sum(uniqueQ1$id)
Q2_ids <- sum(unique(Q1$id) %in% unique(Q2$id))
Q3_ids <- sum(unique(Q1$id) %in% unique(Q3$id))
Q4_ids <- sum(unique(Q1$id) %in% unique(Q4$id))
Cohort analysis also includes counting the number of new users per quarter. As opposed to manually counting the unique user ids that appear per quarter that had not appeared in any previous quarters, I am seeking a programmatic solution. In this case, the output would be the following list:
> mylist
[[1]]
[1] 7 5 8 3 6 10
[[2]]
[1] 2
[[3]]
[1] 9 4
[[4]]
[1] 1
Any suggestions would be appreciated
Here is the sample data:
> dput(test)
structure(list(id = c(7, 7, 5, 8, 3, 6, 10, 3, 10, 8, 2, 7, 6,
6, 9, 6, 4, 9, 2, 5, 8, 10, 7, 1, 8), quarter = c("Q1", "Q1",
"Q1", "Q1", "Q1", "Q1", "Q1", "Q2", "Q2", "Q2", "Q2", "Q2", "Q2",
"Q3", "Q3", "Q3", "Q3", "Q3", "Q3", "Q4", "Q4", "Q4", "Q4", "Q4",
"Q4")), .Names = c("id", "quarter"), row.names = c(NA, -25L), class = "data.frame")
Upvotes: 2
Views: 656
Reputation: 5966
Here's an example using data.table
library(data.table)
setDT(test)[, sum(unique(id) %in% test[quarter=="Q1", id]), by = quarter]
# quarter V1
#1: Q1 6
#2: Q2 5
#3: Q3 1
#4: Q4 4
And for the second part of the analysis:
split(test$id[!duplicated(test$id)], test$quarter[!duplicated(test$id)])
#$Q1
#[1] 7 5 8 3 6 10
#$Q2
#[1] 2
#$Q3
#[1] 9 4
#$Q4
#[1] 1
This seems a bit clunky to me but was the easiest way to keep track of the ids seen as you progress through the quarters.
quarts <- sort(unique(test$quarter))
test$occur <- 1
mat <- dcast.data.table(test, id ~ quarter, value.var = "occur", fun.aggregate = sum)
res <- mat[Q1 >0, lapply(.SD, function(x) sum(x>0)), .SDcols=colnames(mat)[-1]] ##initalize the results with "Q1"
cumMat <- mat$Q1 ##this will keep track of the ids that were seen in previous quarters
for(i in quarts[-1]){ ##foreach quarter (except ("Q1") we will count the number of unique ids that are in that quarter and not in any previous quarters.
res <- rbind(res, mat[cumMat == 0 & get(i) > 0, lapply(.SD, function(x) sum(x>0)), .SDcols=colnames(mat)[-1]])
cumMat <- rowSums(cbind(cumMat, mat[, i, with = F])) ##update the ids seen
}
t(res)
# [,1] [,2] [,3] [,4]
#Q1 6 0 0 0
#Q2 5 1 0 0
#Q3 1 1 2 0
#Q4 4 0 0 1
Upvotes: 2
Reputation: 7856
How about this approach?
library(tidyverse)
test %>%
distinct() %>%
mutate(value = T) %>%
spread(quarter, value) %>%
filter(!is.na(Q1)) %>%
select(-id) %>%
colSums(na.rm = T)
# Q1 Q2 Q3 Q4
# 6 5 1 4
Upvotes: 1
Reputation: 5273
For the unique cohort counts per quarter:
q1_ids <- test[test$quarter == 'Q1', 'id']
test_from_q1 <- test[test$id %in% q1_ids, ]
unique_from_q1 <- unique(test_from_q1)
quarter_counts <- table(unique_from_q1$quarter)
output <- as.data.frame(quarter_counts)
names(output) <- c('quarter', 'unique_cohort_ids')
For the second part, would a data.frame be acceptable?
unique_ids <- unique(test$id)
first_appearance <- data.frame(
id = unique_ids,
quarter = test$quarter[match(unique_ids, test$id)]
)
If not, then a tapply can list-ify it:
tapply(
first_appearance$id,
first_appearance$quarter,
'identity',
simplify = FALSE
)
Upvotes: 0