Reputation: 850

With min() in R return NA instead of Inf

Please consider the following:

I recently 'discovered' the awesome plyr and dplyr packages and use those for analysing patient data that is available to me in a data frame. Such a data frame could look like this:

df <- data.frame(id = c(1, 1, 1, 2, 2), # patient ID
                 diag = c(rep("dia1", 3), rep("dia2", 2)), # diagnosis
                 age = c(7.8, NA, 7.9, NA, NA)) # patient age

I would like to summarise the minimum patient age of all patients with a median and mean. I did the following:

min.age <- df %>% 
  group_by(id) %>% 
  summarise(min.age = min(age, na.rm = T))

Since there are NAs in the data frame I receive the warning:

`Warning message: In min(age, na.rm = T) :
no non-missing arguments to min; returning Inf`

With Inf I cannot call summary(df$min.age) in a meaningful way.

Using pmin() instead of min returned the error message:

Error in summarise_impl(.data, dots) :
 Column 'in.age' must be length 1 (a summary value), not 3

What can I do to avoid any Inf and instead get NA so that I can further proceed with: summary(df$min.age)?

Thanks a lot!

Upvotes: 11

Answers (9)

Maël

Reputation: 51974

Here is a function that can be used with min, but also max or mean, that avoids this problem, and makes it more generalizable:

safe <- function(x, f, ...) ifelse(all(is.na(x)), NA, f(x, na.rm = TRUE, ...))

For example:

library(dplyr) 
df <- data.frame(id = c(1, 1, 1, 2, 2), # patient ID
                 diag = c(rep("dia1", 3), rep("dia2", 2)), # diagnosis
                 age = c(7.8, NA, 7.9, NA, NA), # patient age
                 age2 = c(1, 2, 3, 4, 5)) # new column

df %>% 
  group_by(id) %>% 
  mutate(across(c(age, age2), list(min = ~ safe(.x, min),
                                   max = ~ safe(.x, max),
                                   mean = ~ safe(.x, mean))))

     id diag    age  age2 age_min age_max age_mean age2_min age2_max age2_mean
  <dbl> <chr> <dbl> <dbl>   <dbl>   <dbl>    <dbl>    <dbl>    <dbl>     <dbl>
1     1 dia1    7.8     1     7.8     7.9     7.85        1        3       2  
2     1 dia1   NA       2     7.8     7.9     7.85        1        3       2  
3     1 dia1    7.9     3     7.8     7.9     7.85        1        3       2  
4     2 dia2   NA       4    NA      NA      NA           4        5       4.5
5     2 dia2   NA       5    NA      NA      NA           4        5       4.5

Upvotes: 0

Henrik

Reputation: 67778

Using collapse::fmin:

fmin(NA, na.rm = TRUE)
# [1] NA

Note that na.rm defaults to TRUE, so fmin would suffice.

fmin(c(NA, 1, 2))
# [1] 1

Upvotes: 2

Bernhard

Reputation: 11

This one seems interesting as it avoids the warning:

myMin <- function(vec) {
      ifelse(length(vec[!is.na(vec)]) == 0, NA_real_, min(vec, na.rm = TRUE))
    }

Upvotes: 1

davsjob

Reputation: 1950

an even simpler solution is the s function from the hablar package. It replaces empty vector with NA before evaluated in min/max. The code chunk by @awchisholm could be:

library(hablar)

min.age <- df %>% 
  group_by(id) %>% 
  summarise(min.age = min(s(age)))

disclaimer I am biased for this solution since I authored the package.

Upvotes: 5

Tim Churches

Reputation: 31

The question has been answered, but it is useful to point out that if the column in question is a Date or a datetime, then it will still appear to be an NA in the summary table, but actually isn't. This is doubly confusing! Consider:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data.frame(date = as.Date(c("2013-01-01", "2013-05-23", "", "2017-04-15", "", "")),
                 int = c(1L, 2L, NA, 4L, NA, NA),
                 group = rep(LETTERS[1:3],2))

s1 <- df %>% group_by(group) %>% summarise(min_date = min(date), min_int = min(int)) %>% mutate(min_date_missing = is.na(min_date), min_int_missing = is.na(min_int))
#> Warning: package 'bindrcpp' was built under R version 3.4.4
s2 <- df %>% group_by(group) %>% summarise(min_date = min(date, na.rm = TRUE), min_int = min(int, na.rm = TRUE)) %>% mutate(min_date_missing = is.na(min_date), min_int_missing = is.na(min_int))

df
#>         date int group
#> 1 2013-01-01   1     A
#> 2 2013-05-23   2     B
#> 3       <NA>  NA     C
#> 4 2017-04-15   4     A
#> 5       <NA>  NA     B
#> 6       <NA>  NA     C
s1
#> # A tibble: 3 x 5
#>   group min_date   min_int min_date_missing min_int_missing
#>   <fct> <date>       <dbl> <lgl>            <lgl>          
#> 1 A     2013-01-01      1. FALSE            FALSE          
#> 2 B     NA             NA  TRUE             TRUE           
#> 3 C     NA             NA  TRUE             TRUE
s2
#> # A tibble: 3 x 5
#>   group min_date   min_int min_date_missing min_int_missing
#>   <fct> <date>       <dbl> <lgl>            <lgl>          
#> 1 A     2013-01-01      1. FALSE            FALSE          
#> 2 B     2013-05-23      2. FALSE            FALSE          
#> 3 C     NA            Inf  FALSE            FALSE

s1[[3,2]]
#> [1] NA
s2[[3,2]]
#> [1] NA

is.na(s1[[3,2]])
#> [1] TRUE
is.na(s2[[3,2]])
#> [1] FALSE

s1[[3,2]] == Inf
#> [1] NA
s2[[3,2]] == Inf
#> [1] TRUE

s1[[3,3]]
#> [1] NA
s2[[3,3]]
#> [1] Inf

is.na(s1[[3,3]])
#> [1] TRUE
is.na(s2[[3,3]])
#> [1] FALSE

s1[[3,2]] == Inf
#> [1] NA
s2[[3,2]] == Inf
#> [1] TRUE

sessionInfo()
#> R version 3.4.3 (2017-11-30)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.5
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] bindrcpp_0.2.2 dplyr_0.7.4   
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.17     utf8_1.1.3       crayon_1.3.4     digest_0.6.15   
#>  [5] rprojroot_1.3-2  assertthat_0.2.0 R6_2.2.2         backports_1.1.2 
#>  [9] magrittr_1.5     evaluate_0.10.1  pillar_1.2.1     cli_1.0.0       
#> [13] rlang_0.2.0.9001 stringi_1.1.7    rmarkdown_1.9    tools_3.4.3     
#> [17] stringr_1.3.0    glue_1.2.0       yaml_2.1.18      compiler_3.4.3  
#> [21] pkgconfig_2.0.1  htmltools_0.3.6  bindr_0.1.1      knitr_1.20      
#> [25] tibble_1.4.2

Created on 2018-06-27 by the reprex package (v0.2.0.9000).

Upvotes: 1

MKR

Reputation: 20085

I prefer to choose my own invalid value. Say 200 will be invalid value for Age.

Now one can twist the use of min function slightly. e.g. min(age, 200, na.rm = TRUE) . This ensure that age is shown as 200 instead of +Inf when all values are missing. The result on df will be:

min.age <- df %>% 
  group_by(id) %>% 
  summarise(min.age = min(age, 200, na.rm = T))

> min.age
# A tibble: 2 x 2
#     id min.age
#  <dbl>   <dbl>
#1  1.00    7.80
#2  2.00  200

Now, its up to programmer how they use/replace this invalid value.

Upvotes: 0

ngm

Reputation: 2589

Your code does the following:

Splits the data frame into groups by id
Applies the min function within each group to the age variable, with the na.rm=TRUE option enabled.

So for id of 1 you get min(c(7.8, NA, 7.9), na.rm=TRUE), which is the same as min(c(7.8, 7.9)) which is just 7.8.

Then, for id of 2 you get min(c(NA, NA), na.rm=TRUE), which is the same as min(c()).

Now, what is the minimum of an empty set of numbers? The definition of "minumum" is "a value smaller than all values in the set", and must satisfy the property that min(A) <= min(B) whenever B is a subset of A. One way to define the minumum of the empty set is to say it is "infinity", and that's how R treats the situation.

You can't really avoid getting Inf in this situation. But you could add another mutate to your chain to change any Inf to whatever you like, such as NA.

df %>% group_by(id) %>% summarize(min_age = min(age, na.rm = TRUE)) %>% 
    mutate(min_age = ifelse(is.infinite(min_age), NA, min_age))

Upvotes: 8

Onyambu

Reputation: 79208

(min.age <- df %>% 
    group_by(id) %>% 
    summarise(min.age = ifelse(all(is.na(age)),NA,min(age, na.rm = T))))
# A tibble: 2 x 2
     id min.age
  <dbl>   <dbl>
1     1     7.8
2     2      NA

Upvotes: 2

Andrew Chisholm

Reputation: 6567

You could use is.infinite() to detect the infinities and ifelse to conditionally set them to NA.

#using your df and the dplyr package
min.age <- 
  df %>% 
  group_by(id) %>% 
  summarise(min.age = min(age, na.rm = T)) %>%
  mutate(min.age = ifelse(is.infinite(min.age), NA, min.age))

Upvotes: 16

With min() in R return NA instead of Inf

Answers (9)

Related Questions