Javier Fajardo
Javier Fajardo

Reputation: 797

Sum NA cases in dplyr's summarise

I can't find what am I doing wrong summarising values with value and with NA. I have read everywhere around that you can count cases in summarise with sum(), and that, to count NA cases, it could be used sum(is.na(variable)).

Actually, I can reproduce that behaviour with a test tibble:

df <- tibble(x = c(rep("a",5), rep("b",5)), y = c(NA, NA, 1, 1, NA, 1, 1, 1, NA, NA))

df %>%
  group_by(x) %>% 
  summarise(one = sum(y, na.rm = T),
            na = sum(is.na(y)))

And this is the expected result:

# A tibble: 2 x 3
      x   one    na
  <chr> <dbl> <int>
1     a     2     3
2     b     3     2

For some reason, I cannot reproduce the result with my data:

mydata <- structure(list(Group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Amphibians", 
"Birds", "Mammals", "Reptiles", "Plants"), class = "factor"), 
    Scenario = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 
    1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("Present", 
    "RCP 4.5", "RCP 8.5"), class = "factor"), year = c(1940, 
    1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 
    1940, 1940, 1940, 1940, 1940, 1940, 1940), random = c("obs", 
    "obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs", 
    "obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs"), 
    species = c("Allobates fratisenescus", "Allobates fratisenescus", 
    "Allobates fratisenescus", "Allobates juanii", "Allobates juanii", 
    "Allobates juanii", "Allobates kingsburyi", "Allobates kingsburyi", 
    "Allobates kingsburyi", "Adelophryne adiastola", "Adelophryne adiastola", 
    "Adelophryne adiastola", "Adelophryne gutturosa", "Adelophryne gutturosa", 
    "Adelophryne gutturosa", "Adelphobates quinquevittatus", 
    "Adelphobates quinquevittatus", "Adelphobates quinquevittatus"
    ), Endemic = c(1, 1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA)), row.names = c(NA, -18L), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), vars = "species", indices = list(
    9:11, 12:14, 15:17, 0:2, 3:5, 6:8), group_sizes = c(3L, 3L, 
3L, 3L, 3L, 3L), biggest_group_size = 3L, labels = structure(list(
    species = c("Adelophryne adiastola", "Adelophryne gutturosa", 
    "Adelphobates quinquevittatus", "Allobates fratisenescus", 
    "Allobates juanii", "Allobates kingsburyi")), row.names = c(NA, 
-6L), class = "data.frame", vars = "species", .Names = "species"), .Names = c("Group", 
"Scenario", "year", "random", "species", "Endemic"))

(my data has several millions of rows, I reproduce here only a part of it)

Testsum <- mydata %>% 
  group_by(Group, Scenario, year, random) %>% 
  summarise(All = n(),
            Endemic = sum(Endemic, na.rm = T),
            noEndemic = sum(is.na(Endemic)))

# A tibble: 3 x 7
# Groups:   Group, Scenario, year [?]
       Group Scenario  year random   All Endemic noEndemic
      <fctr>   <fctr> <dbl>  <chr> <int>   <dbl>     <int>
1 Amphibians  Present  1940    obs     6       3         0
2 Amphibians  RCP 4.5  1940    obs     6       3         0
3 Amphibians  RCP 8.5  1940    obs     6       3         0

!!!! I expected no Endemic to be 3 for all cases, as there are NA in 3 of the species...

I doubled-checked that:

Test3$Endemic %>% class
[1] "numeric"

Obviously, there is something very stupid I am not seen... after several hours messing around. Is it obvious for any of you? Thanks!!!

Upvotes: 2

Views: 2511

Answers (1)

akrun
akrun

Reputation: 886928

The reason for this behavior is that we assigned Endemic as a new summarized variable. Instead we should be having a new column name

mydata %>%
     group_by(Group, Scenario, year, random) %>%
     summarise(All = n(),
               EndemicS = sum(Endemic, na.rm = TRUE),
               noEndemic = sum(is.na(Endemic))) %>%
     rename(Endemic = EndemicS) 
# A tibble: 3 x 7
# Groups:   Group, Scenario, year [3]
#       Group Scenario  year random   All Endemic noEndemic
#      <fctr>   <fctr> <dbl>  <chr> <int>   <dbl>     <int>
#1 Amphibians  Present  1940    obs     6       3         3
#2 Amphibians  RCP 4.5  1940    obs     6       3         3
#3 Amphibians  RCP 8.5  1940    obs     6       3         3

Upvotes: 4

Related Questions