Reputation: 466
I'm trying to find the total count of how many users in df1 meet conditions specified in df2, but keep getting an error message.
df1 looks like this:
id step1 step2
1 session_start NA
2 session_start NA
3 session_start sign_up
4 session_start sign_up
5 session_start sign_up
6 sign_up session_start
df1 <- Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 6 obs. of 3 variables:
$ id : chr "1" "2" "3" "4" ...
$ step1: chr "session_start" "session_start" "session_start" "session_start" ...
$ step2: chr NA NA "sign_up" "sign_up" ..
df2 looks like this:
step1 step2 count
session_start sign_up 0
sign_up in_screen 0
in_screen click_banner 0
session_stop session_stop 0
df2 <- structure(c("session_start", "sign_up", "0", "sign_up",
"in_screen", "0", "in_screen", "click_banner", "0", "session_stop",
"session_stop", "0", .Dim = c(3L, 4L), .Dimnames = list(c("step1", "step2",
"count"), NULL))
In the column df2$count
, I would like to show how many (sum of) users completed df2$step1
and df2$step2
in that order. In the sample code above, the first row in df2$count
would output 3, as 3 users in df1 completed session_start
as df1$step1
and sign_up
as df1step2
.
When I try doing this manually using this code, everything works fine:
count <- sum(df1$step1 == "session_start" & df1$step2 == "sign_up", na.rm = TRUE)
However, when I replace "session_start" and "sign_up" with dynamic values, I get the error that "Error in test8$step1 : $ operator is invalid for atomic vectors":
df2$count <- sum(df1$step1 == df2$step1 & df1$step2 == df2$step2, na.rm = TRUE)
I tried replacing "$" with "[]", but still get "Error: Columns session_start
, sign_up
, in_screen
, click_banner
, session_stop
not found":
df2[count] <- sum(df1[step1] == df2[step1] & df1[step2] == df2[step2], na.rm = TRUE)
I'd love to be able to add the extra column to the data as shown below. Are you able to help? Many thanks in advance if so!
step1 step2 count
session_start sign_up 3
sign_up in_screen 0
in_screen click_banner 0
session_stop session_stop 0
Upvotes: 3
Views: 403
Reputation: 76432
Here is a tidyverse
solution.
library(tidyverse)
df2 %>%
group_by(step1, step2) %>%
mutate(count = sum(step1 == df1$step1 & step2 == df1$step2, na.rm = TRUE))
## A tibble: 4 x 3
## Groups: step1, step2 [4]
# step1 step2 count
# <chr> <chr> <int>
#1 session_start sign_up 3
#2 sign_up in_screen 0
#3 in_screen click_banner 0
#4 session_stop session_stop 0
Note that instead of mutate
you can also use summarise
but the order of the output rows will be different.
Upvotes: 3
Reputation: 388982
You could use mapply
and calculate number of step1
and step2
values completed in df1
.
df2$count <- mapply(function(x, y)
sum(df1$step1 == x & df1$step2 == y, na.rm = TRUE), df2$step1, df2$step2)
df2
# step1 step2 count
#1 session_start sign_up 3
#2 sign_up in_screen 0
#3 in_screen click_banner 0
#4 session_stop session_stop 0
data
df1 <- structure(list(id = c("1", "2", "3", "4", "5", "6"),
step1 = c("session_start", "session_start", "session_start",
"session_start", "session_start",
"sign_up"), step2 = c(NA, NA, "sign_up", "sign_up", "sign_up",
"session_start")), .Names = c("id", "step1", "step2"), row.names = c(NA,
-6L), class = "data.frame")
df2 <- structure(list(step1 = c("session_start", "sign_up", "in_screen",
"session_stop"), step2 = c("sign_up", "in_screen", "click_banner",
"session_stop")), .Names = c("step1", "step2"), row.names = c(NA,
-4L), class = "data.frame")
Upvotes: 3