Reputation: 797
I can't find what am I doing wrong summarising values with value and with NA. I have read everywhere around that you can count cases in summarise with sum(), and that, to count NA cases, it could be used sum(is.na(variable)).
Actually, I can reproduce that behaviour with a test tibble:
df <- tibble(x = c(rep("a",5), rep("b",5)), y = c(NA, NA, 1, 1, NA, 1, 1, 1, NA, NA))
df %>%
group_by(x) %>%
summarise(one = sum(y, na.rm = T),
na = sum(is.na(y)))
And this is the expected result:
# A tibble: 2 x 3
x one na
<chr> <dbl> <int>
1 a 2 3
2 b 3 2
For some reason, I cannot reproduce the result with my data:
mydata <- structure(list(Group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Amphibians",
"Birds", "Mammals", "Reptiles", "Plants"), class = "factor"),
Scenario = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("Present",
"RCP 4.5", "RCP 8.5"), class = "factor"), year = c(1940,
1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940,
1940, 1940, 1940, 1940, 1940, 1940, 1940), random = c("obs",
"obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs",
"obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs"),
species = c("Allobates fratisenescus", "Allobates fratisenescus",
"Allobates fratisenescus", "Allobates juanii", "Allobates juanii",
"Allobates juanii", "Allobates kingsburyi", "Allobates kingsburyi",
"Allobates kingsburyi", "Adelophryne adiastola", "Adelophryne adiastola",
"Adelophryne adiastola", "Adelophryne gutturosa", "Adelophryne gutturosa",
"Adelophryne gutturosa", "Adelphobates quinquevittatus",
"Adelphobates quinquevittatus", "Adelphobates quinquevittatus"
), Endemic = c(1, 1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA,
NA, NA, NA, NA, NA)), row.names = c(NA, -18L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), vars = "species", indices = list(
9:11, 12:14, 15:17, 0:2, 3:5, 6:8), group_sizes = c(3L, 3L,
3L, 3L, 3L, 3L), biggest_group_size = 3L, labels = structure(list(
species = c("Adelophryne adiastola", "Adelophryne gutturosa",
"Adelphobates quinquevittatus", "Allobates fratisenescus",
"Allobates juanii", "Allobates kingsburyi")), row.names = c(NA,
-6L), class = "data.frame", vars = "species", .Names = "species"), .Names = c("Group",
"Scenario", "year", "random", "species", "Endemic"))
(my data has several millions of rows, I reproduce here only a part of it)
Testsum <- mydata %>%
group_by(Group, Scenario, year, random) %>%
summarise(All = n(),
Endemic = sum(Endemic, na.rm = T),
noEndemic = sum(is.na(Endemic)))
# A tibble: 3 x 7
# Groups: Group, Scenario, year [?]
Group Scenario year random All Endemic noEndemic
<fctr> <fctr> <dbl> <chr> <int> <dbl> <int>
1 Amphibians Present 1940 obs 6 3 0
2 Amphibians RCP 4.5 1940 obs 6 3 0
3 Amphibians RCP 8.5 1940 obs 6 3 0
!!!! I expected no Endemic to be 3 for all cases, as there are NA in 3 of the species...
I doubled-checked that:
Test3$Endemic %>% class
[1] "numeric"
Obviously, there is something very stupid I am not seen... after several hours messing around. Is it obvious for any of you? Thanks!!!
Upvotes: 2
Views: 2511
Reputation: 886928
The reason for this behavior is that we assigned Endemic
as a new summarized variable. Instead we should be having a new column name
mydata %>%
group_by(Group, Scenario, year, random) %>%
summarise(All = n(),
EndemicS = sum(Endemic, na.rm = TRUE),
noEndemic = sum(is.na(Endemic))) %>%
rename(Endemic = EndemicS)
# A tibble: 3 x 7
# Groups: Group, Scenario, year [3]
# Group Scenario year random All Endemic noEndemic
# <fctr> <fctr> <dbl> <chr> <int> <dbl> <int>
#1 Amphibians Present 1940 obs 6 3 3
#2 Amphibians RCP 4.5 1940 obs 6 3 3
#3 Amphibians RCP 8.5 1940 obs 6 3 3
Upvotes: 4