Reputation: 4243
I have a dataset that looks like this:
User<- c("User1", "User1","User1", "User1","User1", "User1","User1", "User2","User2","User2","User2","User2","User2","User2")
Touchpoints <- c("A", "B", "C", "F", "D", "E", "H","A", "B", "K", "D", "E", "F", "M")
Conversion <- c(0,0,0,1,0,0,1,0,0,1,1,0,0,1)
Frequency<-c(1,2,3,0,4,5,0,1,2,0,0,3,4,5)
df<-data.frame(User, Touchpoints, Conversion, Frequency)
df$Exponential<-ifelse(df$Frequency>0, exp(df$Frequency), 0)
df
User Touchpoints Conversion Frequency Exponential
1 User1 A 0 1 2.718282
2 User1 B 0 2 7.389056
3 User1 C 0 3 20.085537
4 User1 F 1 0 0.000000
5 User1 D 0 4 54.598150
6 User1 E 0 5 148.413159
7 User1 H 1 0 0.000000
8 User2 A 0 1 2.718282
9 User2 B 0 2 7.389056
10 User2 K 1 0 0.000000
11 User2 D 1 0 0.000000
12 User2 E 0 3 20.085537
13 User2 F 0 4 54.598150
14 User2 M 1 5 148.413159
Here is what I am trying to do:
I want to take the percentage of the Exponential
value represented by _Conv
from the sum of the Exponential
Column by User
up to that Conversion
value. Here is an example:
User Touchpoints Conversion Frequency Exponential Sum of Exp 1st_Conv Sum_Exp_for_Conv2 2nd_Conv
1 User1 A 0 1 2.718282 30.192 0.0900 233.204 0.0116
2 User1 B 0 2 7.389056 30.192 0.2447 233.204 0.0317
3 User1 C 0 3 20.085537 30.192 0.6652 233.204 0.0861
4 User1 F 1 0 0.000000 0 0.0000 233.204 0
5 User1 D 0 4 54.598150 0 0.0000 233.204 0.2341
6 User1 E 0 5 148.413159 0 0.0000 233.204 0.6364
7 User1 H 1 0 0.000000 0 0.0000 0 0
8 User2 A 0 1 2.718282 10.107 0.2689 10.107 0.2689
9 User2 B 0 2 7.389056 10.107 0.7311 10.107 0.7311
10 User2 K 1 0 0.000000 0 0.0000 0 0
11 User2 D 1 0 0.000000 0 0.0000 0 0
12 User2 E 0 3 20.085537 0 0.0000 0 0
13 User2 F 0 4 54.598150 0 0.0000 0 0
14 User2 M 0 5 148.413159 0 0.0000 0 0
There will be cases where there will be over 100 conversions per user and doing it this way would create thousands of columns, seems like it is not scalable.
My ultimate output is to sum all of the _Conv
into one final column called Final_Conv
. For this example, the final output would look like this:
User Touchpoints Conversion Frequency Final_Conv
1 User1 A 0 1 0.1017
2 User1 B 0 2 0.2764
3 User1 C 0 3 0.7514
4 User1 F 1 0 0
5 User1 D 0 4 0.2341
6 User1 E 0 5 0.6364
7 User1 H 1 0 0
8 User2 A 0 1 0.5379
9 User2 B 0 2 1.4621
10 User2 K 1 0 0
11 User2 D 1 0 0
12 User2 E 0 3 0
13 User2 F 0 4 0
14 User2 M 0 5 0
Any help would be great, thanks!
Upvotes: 1
Views: 56
Reputation: 18661
Probably not the simplest code, but we can do something like the following:
library(dplyr)
library(tidyr)
df %>%
group_by(User) %>%
mutate(row_id = row_number(),
conv_id = cumsum(Conversion),
exp_cumsum = cumsum(Exponential)) %>%
group_by(conv_id, add = TRUE) %>%
mutate(sum_of_exp = ifelse(n()==1, NA, last(exp_cumsum))) %>%
spread(conv_id, sum_of_exp, sep = "_") %>%
arrange(User, row_id) %>%
fill(!!!vars(starts_with("conv_id")), .direction = "up") %>%
mutate_at(vars(starts_with("conv_id")), funs(Exponential/.)) %>%
ungroup() %>%
mutate(Final_Conv = rowSums(.[-(1:7)], na.rm = TRUE)) %>%
select(1:4, Final_Conv)
Notes:
I first created a cumulative sum of Conversion
and Exponential
, added conv_id
as an additional grouping variable, and replaced all values in each User
+ conv_id
combination with the last value of exp_cumsum
. Then, spread the conv_id
and sum_of_exp
columns and filled each conv_id_
column up. Finally, used mutate_at
to divide Exponential
by each of the conv_id_
columns and created Final_Conv
by summing all resultant conv_id_
columns with rowSums
.
This solution should work for an arbitrary number of Conversion
s for each User
.
Result:
# A tibble: 14 x 5
User Touchpoints Conversion Frequency Final_Conv
<fctr> <fctr> <dbl> <dbl> <dbl>
1 User1 A 0 1 0.1016868
2 User1 B 0 2 0.2764134
3 User1 C 0 3 0.7513695
4 User1 F 1 0 0.0000000
5 User1 D 0 4 0.2341217
6 User1 E 0 5 0.6364086
7 User1 H 1 0 0.0000000
8 User2 A 0 1 0.3010000
9 User2 B 0 2 0.8182029
10 User2 K 1 0 0.0000000
11 User2 D 1 0 0.0000000
12 User2 E 0 3 0.2368828
13 User2 F 0 4 0.6439143
14 User2 M 1 5 0.0000000
Upvotes: 1