Mr. T
Mr. T

Reputation: 35

Function has unexpected behaviour (mutate function in dplyr)

I expected cumulative sum in "n_rest"-column. But I get only the copy of "n_i"-column. My problem can be solved inserting "# as.data.frame() %>%" but I don't like this solution and I would like to understand the explanation of my mistake.

Thanks in advance!

library(dplyr)

t      <- c(42,57,63,98,104,105,132,132,132,133,133,133,139,140,161,180,180,195,195,233)
status <- c(1 ,1 ,1 ,1 ,0  ,1  ,1  ,1  ,1  ,1  ,1  ,1  ,1  ,1  ,1  ,1  ,1  ,1  ,1  ,  0)

KMP <- function(time,status){

  n_ges = length(t)

  df <- data.frame(t = t, status = status, n = 1)
  df <- df %>%  group_by(t,status) %>%
                summarise(n_i = sum(n)) %>%
                # as.data.frame() %>%
                mutate(n_rest = rev(cumsum(n_i)))

  df

}

Upvotes: 0

Views: 48

Answers (2)

Spacedman
Spacedman

Reputation: 94307

The mutate is still working on the groups.

By passing into as.data.frame you are dropping the grouping. Alternatively reset the grouping by putting an empty group_by in the pipe:

> df %>% group_by(t,status) %>% summarise(n_i=sum(n)) %>% group_by() %>% mutate(n_rest=cumsum(n_i))
# A tibble: 14 x 4
       t status   n_i n_rest
   <dbl>  <dbl> <dbl>  <dbl>
 1    42      1     1      1
 2    57      1     1      2
 3    63      1     1      3
 4    98      1     1      4
 5   104      0     1      5
 6   105      1     1      6
 7   132      1     3      9
 8   133      1     3     12
 9   139      1     1     13
10   140      1     1     14
11   161      1     1     15
12   180      1     2     17
13   195      1     2     19
14   233      0     1     20

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389275

That is because your dataframe is still grouped by t. If you check output of

library(dplyr)
df %>%   group_by(t,status) %>%  summarise(n_i = sum(n))

# A tibble: 14 x 3
# Groups:   t [14]
#       t status   n_i
#   <dbl>  <dbl> <dbl>
# 1    42      1     1
# 2    57      1     1
# 3    63      1     1
# 4    98      1     1
# 5   104      0     1
# 6   105      1     1
# 7   132      1     3
# 8   133      1     3
# 9   139      1     1
#10   140      1     1
#11   161      1     1
#12   180      1     2
#13   195      1     2
#14   233      0     1

From ?summarise

An object of the same class as .data. One grouping level will be dropped.

As you are grouping for t and status, grouping of status is dropped keeping group_by t as it is, hence your cumsum result is grouped by t.

You can remove the effect of grouping by using ungroup after summarise

df %>%  
  group_by(t,status) %>%
  summarise(n_i = sum(n)) %>%
  ungroup() %>%
  mutate(n_rest = rev(cumsum(n_i)))

The same effect was achieved using as.data.frame() in OP's code.

Upvotes: 0

Related Questions