Reputation: 850
Please consider the following:
I recently 'discovered' the awesome plyr
and dplyr
packages and use those for analysing patient data that is available to me in a data frame. Such a data frame could look like this:
df <- data.frame(id = c(1, 1, 1, 2, 2), # patient ID
diag = c(rep("dia1", 3), rep("dia2", 2)), # diagnosis
age = c(7.8, NA, 7.9, NA, NA)) # patient age
I would like to summarise the minimum patient age of all patients with a median and mean. I did the following:
min.age <- df %>%
group_by(id) %>%
summarise(min.age = min(age, na.rm = T))
Since there are NAs
in the data frame I receive the warning:
`Warning message: In min(age, na.rm = T) :
no non-missing arguments to min; returning Inf`
With Inf
I cannot call summary(df$min.age)
in a meaningful way.
Using pmin()
instead of min
returned the error message:
Error in summarise_impl(.data, dots) :
Column 'in.age' must be length 1 (a summary value), not 3
What can I do to avoid any Inf
and instead get NA
so that I can further proceed with:
summary(df$min.age)
?
Thanks a lot!
Upvotes: 11
Views: 11863
Reputation: 51974
Here is a function that can be used with min
, but also max
or mean
, that avoids this problem, and makes it more generalizable:
safe <- function(x, f, ...) ifelse(all(is.na(x)), NA, f(x, na.rm = TRUE, ...))
For example:
library(dplyr)
df <- data.frame(id = c(1, 1, 1, 2, 2), # patient ID
diag = c(rep("dia1", 3), rep("dia2", 2)), # diagnosis
age = c(7.8, NA, 7.9, NA, NA), # patient age
age2 = c(1, 2, 3, 4, 5)) # new column
df %>%
group_by(id) %>%
mutate(across(c(age, age2), list(min = ~ safe(.x, min),
max = ~ safe(.x, max),
mean = ~ safe(.x, mean))))
id diag age age2 age_min age_max age_mean age2_min age2_max age2_mean
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 dia1 7.8 1 7.8 7.9 7.85 1 3 2
2 1 dia1 NA 2 7.8 7.9 7.85 1 3 2
3 1 dia1 7.9 3 7.8 7.9 7.85 1 3 2
4 2 dia2 NA 4 NA NA NA 4 5 4.5
5 2 dia2 NA 5 NA NA NA 4 5 4.5
Upvotes: 0
Reputation: 67778
Using collapse::fmin
:
fmin(NA, na.rm = TRUE)
# [1] NA
Note that na.rm
defaults to TRUE
, so fmin
would suffice.
fmin(c(NA, 1, 2))
# [1] 1
Upvotes: 2
Reputation: 11
This one seems interesting as it avoids the warning:
myMin <- function(vec) {
ifelse(length(vec[!is.na(vec)]) == 0, NA_real_, min(vec, na.rm = TRUE))
}
Upvotes: 1
Reputation: 1950
an even simpler solution is the s function from the hablar package. It replaces empty vector with NA before evaluated in min/max. The code chunk by @awchisholm could be:
library(hablar)
min.age <- df %>%
group_by(id) %>%
summarise(min.age = min(s(age)))
disclaimer I am biased for this solution since I authored the package.
Upvotes: 5
Reputation: 31
The question has been answered, but it is useful to point out that if the column in question is a Date or a datetime, then it will still appear to be an NA in the summary table, but actually isn't. This is doubly confusing! Consider:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(date = as.Date(c("2013-01-01", "2013-05-23", "", "2017-04-15", "", "")),
int = c(1L, 2L, NA, 4L, NA, NA),
group = rep(LETTERS[1:3],2))
s1 <- df %>% group_by(group) %>% summarise(min_date = min(date), min_int = min(int)) %>% mutate(min_date_missing = is.na(min_date), min_int_missing = is.na(min_int))
#> Warning: package 'bindrcpp' was built under R version 3.4.4
s2 <- df %>% group_by(group) %>% summarise(min_date = min(date, na.rm = TRUE), min_int = min(int, na.rm = TRUE)) %>% mutate(min_date_missing = is.na(min_date), min_int_missing = is.na(min_int))
df
#> date int group
#> 1 2013-01-01 1 A
#> 2 2013-05-23 2 B
#> 3 <NA> NA C
#> 4 2017-04-15 4 A
#> 5 <NA> NA B
#> 6 <NA> NA C
s1
#> # A tibble: 3 x 5
#> group min_date min_int min_date_missing min_int_missing
#> <fct> <date> <dbl> <lgl> <lgl>
#> 1 A 2013-01-01 1. FALSE FALSE
#> 2 B NA NA TRUE TRUE
#> 3 C NA NA TRUE TRUE
s2
#> # A tibble: 3 x 5
#> group min_date min_int min_date_missing min_int_missing
#> <fct> <date> <dbl> <lgl> <lgl>
#> 1 A 2013-01-01 1. FALSE FALSE
#> 2 B 2013-05-23 2. FALSE FALSE
#> 3 C NA Inf FALSE FALSE
s1[[3,2]]
#> [1] NA
s2[[3,2]]
#> [1] NA
is.na(s1[[3,2]])
#> [1] TRUE
is.na(s2[[3,2]])
#> [1] FALSE
s1[[3,2]] == Inf
#> [1] NA
s2[[3,2]] == Inf
#> [1] TRUE
s1[[3,3]]
#> [1] NA
s2[[3,3]]
#> [1] Inf
is.na(s1[[3,3]])
#> [1] TRUE
is.na(s2[[3,3]])
#> [1] FALSE
s1[[3,2]] == Inf
#> [1] NA
s2[[3,2]] == Inf
#> [1] TRUE
sessionInfo()
#> R version 3.4.3 (2017-11-30)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.5
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] bindrcpp_0.2.2 dplyr_0.7.4
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_0.12.17 utf8_1.1.3 crayon_1.3.4 digest_0.6.15
#> [5] rprojroot_1.3-2 assertthat_0.2.0 R6_2.2.2 backports_1.1.2
#> [9] magrittr_1.5 evaluate_0.10.1 pillar_1.2.1 cli_1.0.0
#> [13] rlang_0.2.0.9001 stringi_1.1.7 rmarkdown_1.9 tools_3.4.3
#> [17] stringr_1.3.0 glue_1.2.0 yaml_2.1.18 compiler_3.4.3
#> [21] pkgconfig_2.0.1 htmltools_0.3.6 bindr_0.1.1 knitr_1.20
#> [25] tibble_1.4.2
Created on 2018-06-27 by the reprex package (v0.2.0.9000).
Upvotes: 1
Reputation: 20085
I prefer to choose my own invalid value. Say 200
will be invalid value for Age
.
Now one can twist the use of min
function slightly. e.g. min(age, 200, na.rm = TRUE)
. This ensure that age is shown as 200
instead of +Inf
when all values are missing. The result on df
will be:
min.age <- df %>%
group_by(id) %>%
summarise(min.age = min(age, 200, na.rm = T))
> min.age
# A tibble: 2 x 2
# id min.age
# <dbl> <dbl>
#1 1.00 7.80
#2 2.00 200
Now, its up to programmer how they use/replace this invalid value.
Upvotes: 0
Reputation: 2589
Your code does the following:
id
min
function within each group to the age
variable, with the na.rm=TRUE
option enabled.So for id
of 1
you get min(c(7.8, NA, 7.9), na.rm=TRUE)
, which is the same as min(c(7.8, 7.9))
which is just 7.8.
Then, for id
of 2
you get min(c(NA, NA), na.rm=TRUE)
, which is the same as min(c())
.
Now, what is the minimum of an empty set of numbers? The definition of "minumum" is "a value smaller than all values in the set", and must satisfy the property that min(A) <= min(B) whenever B is a subset of A. One way to define the minumum of the empty set is to say it is "infinity", and that's how R treats the situation.
You can't really avoid getting Inf
in this situation. But you could add another mutate
to your chain to change any Inf
to whatever you like, such as NA
.
df %>% group_by(id) %>% summarize(min_age = min(age, na.rm = TRUE)) %>%
mutate(min_age = ifelse(is.infinite(min_age), NA, min_age))
Upvotes: 8
Reputation: 79208
(min.age <- df %>%
group_by(id) %>%
summarise(min.age = ifelse(all(is.na(age)),NA,min(age, na.rm = T))))
# A tibble: 2 x 2
id min.age
<dbl> <dbl>
1 1 7.8
2 2 NA
Upvotes: 2
Reputation: 6567
You could use is.infinite()
to detect the infinities and ifelse
to conditionally set them to NA
.
#using your df and the dplyr package
min.age <-
df %>%
group_by(id) %>%
summarise(min.age = min(age, na.rm = T)) %>%
mutate(min.age = ifelse(is.infinite(min.age), NA, min.age))
Upvotes: 16