alan gao
alan gao

Reputation: 51

sum() with conditions provides incorrect result in dplyr package

When applying sum() with conditions in summarize() function, it does not provide the correct answer.

Make a data frame x:

x = data.frame(flag = 1, uin = 1, val = 2)
x = rbind(x, data.frame(flag = 2, uin = 2, val = 3)) 

This is what x looks like:

  flag uin val
1    1   1   2
2    2   2   3

I want to sum up the val and the val with flag == 2, so I write

x %>% summarize(val = sum(val), val.2 = sum(val[flag == 2]))

and the result is:

  val val.2
1   5    NA

But what I expect is that val.2 is 3 instead of NA. For more information, if I calculate the conditional summation first then the total summation, it comes out with the correct answer:

x %>% summarize(val.2 = sum(val[flag == 2]), val = sum(val))
  val.2 val
1     3   5

Moreover, if I only calculate the conditional summation, it works fine too:

x %>% summarize(val.2 = sum(val[flag == 2]))
  val.2
1     3

Upvotes: 2

Views: 731

Answers (1)

csgillespie
csgillespie

Reputation: 60472

Duplicate names are causing you problems. In this code

x %>% summarize(val = sum(val), val.2 = sum(val[flag == 2]))

You have two val objects. One created from val = sum(val) and other from the data frame x. In your code, you change val from the data frame value to val=sum(val) = 5. Then you do

`val[flag == 2]`

which gives a vector c(2, NA), since val = 5. Hence, when you add 2 + NA you get NA. The solution, don't use val twice,

x %>% summarize(val_sum = sum(val), val.2 = sum(val[flag == 2]))

Upvotes: 4

Related Questions