spore234
spore234

Reputation: 3640

catch NAs using linear model with dplyr

here's an exmaple data frame

library(dplyr)
df <- data.frame(id=c(1,1,1,2,2,2),
   v2=factor(c("a","c","c","a","b","d")),
   v3=c(1,NA,NA,6,7,9),
   v4=c(5:10))

Note that v3 contains NAs, so when I try to fit a linear model for each id, I get an error:

slope <- df %>% filter(v2=="c") %>% 
  group_by(id) %>% 
  do(fit = lm(v3 ~ v4, .)) %>%
  summarise(slope = coef(fit)[2])

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...)   : 
  0 (non-NA) cases

How can I catch this error and replace it with a default value if only NAs exist.

Note that it could also happen that v4 has NAs, and if v3=c(1,NA) and v4=c(NA,2) it could not build a linear model as well.

For example, if df does not contain any "c" then I can do this easily with

if(nrow(slope) == 0) slope <- 0

because then slope is an empty data frame.

Upvotes: 1

Views: 369

Answers (2)

jowalski
jowalski

Reputation: 131

If you are literally asking "how can I catch this error", you could try tryCatch.

This may be more useful depending on the situation, it would ignore only errors with the "0 (non-NA) cases" message, and you wouldn't have to do messy data checking.

You can also use failwith in the plyr package, though I believe that catches all error messages. However it is simpler to use.

all_na_msg <- "0 (non-NA) cases";
trymodel <- function(df, default = NA) {
  tryCatch(lm(v3 ~ v4, df),
           error = if (e$message == all_na_msg)
                     default
                   else
                     stop(e));
}

slope <- df %>% filter(v2=="c") %>% 
  group_by(id) %>% 
  do(fit = trymodel(df)) %>%
  summarise(slope = coef(fit)[2])

Upvotes: 1

akrun
akrun

Reputation: 887078

We could use an if/else condition within do to check the NA elements. If all the elements are NA in either 'v3' or (|) 'v4', it should return the slope as NA or else do the lm and get the slope value.

df %>% 
  filter(v2=='c') %>%
  group_by(id) %>%
  do({if(all(is.na(.$v3))|all(is.na(.$v4))) 
              data.frame(slope=NA) 
             else data.frame(slope=coef(lm(v3~v4, .))[2])}) %>%
  slice(1L) %>% 
  ungroup() %>%
  select(-id)

data

df <- data.frame(id=c(1,1,1,2,2,2, 3, 3, 3,3, 3, 4, 4),
 v2=factor(c("a","c","c","a","b","d", "c", "c", "a", "c", "c", "c", "c")),
 v3=c(1,NA,NA,6,7,9, NA, 1, NA, 5,8, NA, 5 ),
 v4=c(5:17))

Upvotes: 3

Related Questions