Zhubarb
Zhubarb

Reputation: 11905

R dropping NA's in logical column levels

I have a dataframe, which includes a corrupt row with NAs and "". I cannot remove this from the .csv file I am importing into R since Excel cannot deal with (opening) the size of the .csv document.

I do a check when I first read.csv() like below to remove the row with NA:

  if ( any( is.na(unique(data$A)) )   ){
  print("WARNING: data has a corrupt row in it!")  
  data <- data[ !is.na(data$A) , ]  
  }

However, as if it is a factor, the Acolumn remembers NA as a level:

> summary(data$A)
   Mode   FALSE    TRUE    NA's 
logical  185692   36978       0 

This obviously causes issues when I am trying to fit a linear model. How can I get rid of the NA as a logical level here?

I tried this but doesn't seem to work:

A <- as.logical(droplevels(factor(data_combine$A)))
summary(A)
   Mode   FALSE    TRUE    NA's 
logical  185692   36978       0 
unique(A)
[1] FALSE  TRUE

Upvotes: 0

Views: 1527

Answers (2)

Rich Scriven
Rich Scriven

Reputation: 99351

As mentioned in my other answer, those actually are not factor levels. Since you asked how to remove the NA printing on summary, I'm undeleting this answer.

The NA printing is hard-coded into a summary for a logical vector. Here's the relevant code from summary.default.

# value <- if (is.logical(object)) 
#     c(Mode = "logical", {
#         tb <- table(object, exclude = NULL)
#         if (!is.null(n <- dimnames(tb)[[1L]]) && any(iN <- is.na(n))) 
#             dimnames(tb)[[1L]][iN] <- "NA's"
#         tb
#     })

The exclude = NULL in table is the problem. If we look at the exclude argument in table with a logical vector log, we can see that when it is NULL the NAs always print out.

log <- c(NA, logical(4), NA, !logical(2), NA)
table(log, exclude = NULL)                  ## with NA values
# log
# FALSE  TRUE  <NA> 
#     4     2     3 
table(log[!is.na(log)], exclude = NULL)     ## NA values removed
# 
# FALSE  TRUE  <NA> 
#     4     2     0 

To make your summary print the way you want it, we can write a summary method based on the original source code.

summary.logvec <- function(object, exclude = NA) {
    stopifnot(is.logical(object))
    value <- c(Mode = "logical", {
        tb <- table(object, exclude = exclude)
            if(is.null(exclude)) {
                if (!is.null(n <- dimnames(tb)[[1L]]) && any(iN <- is.na(n)))
                    dimnames(tb)[[1L]][iN] <- "NA's"
            }
            tb
        })
    class(value) <- c("summaryDefault", "table")
    print.summary.logvec <- function(x) {
        UseMethod("print.summaryDefault")
    } 
    value
}

And then here are the results. Since we set exclude = NA in our print method the NAs will not print unless we set it to NULL

summary(log)  ## original vector
#    Mode   FALSE    TRUE    NA's 
# logical       4       2       3 
class(log) <- "logvec"
summary(log, exclude = NULL)  ## prints NA when exclude = NULL
#    Mode   FALSE    TRUE    NA's 
# logical       4       2       3 
summary(log)  ## NA's don't print 
#    Mode   FALSE    TRUE 
# logical       4       2 

Now that I've done all this I'm wondering if you have tried to run your linear model.

Upvotes: 0

Rich Scriven
Rich Scriven

Reputation: 99351

First, your data$A is not a factor, it's a logical. The summary print methods are not the same for factors and logicals. Logicals use summary.default while factors dispatch to summary.factor. Plus it tells you in the result that the variable is a logical.

fac <- factor(c(NA, letters[1:4]))
log <- c(NA, logical(4), !logical(2))
summary(fac)
#   a    b    c    d NA's 
#   1    1    1    1    1 
summary(log)
#    Mode   FALSE    TRUE    NA's 
# logical       4       2       1 

See ?summary for the differences.

Second, your call

A <- as.logical(droplevels(factor(data_combine$A)))
summary(A)

is also calling summary.default because you wrapped droplevels with as.logical (why?). So don't change data_combine$A at all, and just try

summary(data_combine$A)

and see how that goes. For more information, please provide a sample of your data.

Upvotes: 1

Related Questions