Oliver
Oliver

Reputation: 39

Create new variable that summarizes observation given a certain condition

Hello I'm new to R and I dont understand why my following approach does not work. I have this df1 that looks somethig like this:

  view  duration_hours  date 
1  a        5          2021-03-29            
2  a        7          2021-03-29           
3  a        3          2021-03-30            
4  b        2          2021-03-30
5  b        5          2021-03-30
6  c        9          2021-03-30      
7  c        2          2021-03-31            
8  c        3          2021-04-01

I want a new data frame (df2) that sums the durations - over all views and split to the single views for a certain date

  date duration  duration_sum    a    b     c 
1  2021-03-29       12           12   0     0
2  2021-03-30       19           3    7     9           
3  2021-03-31       2            0    0     2 
4  2021-04-01       3            0    0     3

First, I tried the following just for the "overall" duration, worked as intented creating the "duration_sum" variable with the summed durations for every date

df2 <- df1 %>%
  group_by(date) %>%
  summarise(duration_sum = sum(duration_hours, na.rm = TRUE)

Then I tried to add the other variables by augmenting the code in the following way

df2<- df1 %>%
  group_by(date) %>%
  summarise(duration_sum = sum(duration_hours, na.rm = TRUE),
            a =sum(duration_hours[view=="a"], na.r = TRUE),
            b =sum(duration_hours[view=="b"], na.r = TRUE),
            c =sum(duration_hours[view=="c"], na.r = TRUE))

But that did not yield the account to the right amounts. What do I do wrong?

Upvotes: 1

Views: 33

Answers (1)

akrun
akrun

Reputation: 887048

The argument is na.rm and not na.r. When we have an argument that is not matching, the TRUE is coerced to 1 (FALSE to 0 - thus adding 1 to the total)

e.g.

sum(c(1, 2), na.r = TRUE)
#[1] 4
sum(c(1, 2), na.rm = TRUE)
#[1] 3

The OP's corrected code would be

library(dplyr)
df1 %>%
  group_by(date) %>%
   summarise(duration_sum = sum(duration_hours, na.rm = TRUE),
        a =sum(duration_hours[view=="a"], na.rm = TRUE),
        b =sum(duration_hours[view=="b"], na.rm = TRUE),
        c =sum(duration_hours[view=="c"], na.rm = TRUE))
# A tibble: 4 x 5
#  date       duration_sum     a     b     c
#* <chr>             <int> <int> <int> <int>
#1 2021-03-29           12    12     0     0
#2 2021-03-30           19     3     7     9
#3 2021-03-31            2     0     0     2
#4 2021-04-01            3     0     0     3

Or another option is pivot_wider

library(tidyr)
pivot_wider(df1, names_from = view, values_from = duration_hours,    
         values_fn = sum, values_fill = 0)

data

df1 <- structure(list(view = c("a", "a", "a", "b", "b", "c", "c", "c"
), duration_hours = c(5L, 7L, 3L, 2L, 5L, 9L, 2L, 3L), date = c("2021-03-29", 
"2021-03-29", "2021-03-30", "2021-03-30", "2021-03-30", "2021-03-30", 
"2021-03-31", "2021-04-01")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8"))

Upvotes: 2

Related Questions