FPL
FPL

Reputation: 466

Perform sum when multiple conditions required are in another table?

I'm trying to find the total count of how many users in df1 meet conditions specified in df2, but keep getting an error message.

df1 looks like this:

    id  step1          step2
    1   session_start  NA
    2   session_start  NA
    3   session_start  sign_up
    4   session_start  sign_up
    5   session_start  sign_up
    6   sign_up        session_start

df1 <- Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':  6 obs. of  3 variables:
    $ id   : chr  "1" "2" "3" "4" ...
    $ step1: chr  "session_start" "session_start" "session_start" "session_start" ...
    $ step2: chr  NA NA "sign_up" "sign_up" ..

df2 looks like this:

    step1          step2         count
    session_start  sign_up       0
    sign_up        in_screen     0
    in_screen      click_banner  0
    session_stop   session_stop  0

df2 <- structure(c("session_start", "sign_up", "0", "sign_up", 
"in_screen", "0", "in_screen", "click_banner", "0", "session_stop", 
"session_stop", "0", .Dim = c(3L, 4L), .Dimnames = list(c("step1", "step2", 
"count"), NULL))

In the column df2$count, I would like to show how many (sum of) users completed df2$step1 and df2$step2 in that order. In the sample code above, the first row in df2$count would output 3, as 3 users in df1 completed session_start as df1$step1 and sign_up as df1step2.

Previous Attempts

When I try doing this manually using this code, everything works fine:

count <- sum(df1$step1 == "session_start" & df1$step2 == "sign_up", na.rm = TRUE)

However, when I replace "session_start" and "sign_up" with dynamic values, I get the error that "Error in test8$step1 : $ operator is invalid for atomic vectors":

df2$count <- sum(df1$step1 == df2$step1 & df1$step2 == df2$step2, na.rm = TRUE)

I tried replacing "$" with "[]", but still get "Error: Columns session_start, sign_up, in_screen, click_banner, session_stop not found":

df2[count] <- sum(df1[step1] == df2[step1] & df1[step2] == df2[step2], na.rm = TRUE)

Ideal response:

I'd love to be able to add the extra column to the data as shown below. Are you able to help? Many thanks in advance if so!

    step1          step2         count
    session_start  sign_up       3
    sign_up        in_screen     0
    in_screen      click_banner  0
    session_stop   session_stop  0

Upvotes: 3

Views: 403

Answers (2)

Rui Barradas
Rui Barradas

Reputation: 76432

Here is a tidyverse solution.

library(tidyverse)

df2 %>%
  group_by(step1, step2) %>%
  mutate(count = sum(step1 == df1$step1 & step2 == df1$step2, na.rm = TRUE))
## A tibble: 4 x 3
## Groups:   step1, step2 [4]
#  step1         step2        count
#  <chr>         <chr>        <int>
#1 session_start sign_up          3
#2 sign_up       in_screen        0
#3 in_screen     click_banner     0
#4 session_stop  session_stop     0

Note that instead of mutate you can also use summarise but the order of the output rows will be different.

Upvotes: 3

Ronak Shah
Ronak Shah

Reputation: 388982

You could use mapply and calculate number of step1 and step2 values completed in df1.

df2$count <- mapply(function(x, y) 
    sum(df1$step1 == x & df1$step2 == y, na.rm = TRUE), df2$step1, df2$step2)


df2
#          step1        step2 count
#1 session_start      sign_up     3
#2       sign_up    in_screen     0
#3     in_screen click_banner     0
#4  session_stop session_stop     0

data

df1 <- structure(list(id = c("1", "2", "3", "4", "5", "6"), 
step1 = c("session_start", "session_start", "session_start", 
 "session_start", "session_start", 
 "sign_up"), step2 = c(NA, NA, "sign_up", "sign_up", "sign_up", 
"session_start")), .Names = c("id", "step1", "step2"), row.names = c(NA, 
-6L), class = "data.frame")

df2 <- structure(list(step1 = c("session_start", "sign_up", "in_screen", 
"session_stop"), step2 = c("sign_up", "in_screen", "click_banner", 
"session_stop")), .Names = c("step1", "step2"), row.names = c(NA, 
-4L), class = "data.frame")

Upvotes: 3

Related Questions