Reputation: 11905
I have a dataframe, which includes a corrupt row with NA
s and ""
. I cannot remove this from the .csv file I am importing into R since Excel cannot deal with (opening) the size of the .csv document.
I do a check when I first read.csv()
like below to remove the row with NA
:
if ( any( is.na(unique(data$A)) ) ){
print("WARNING: data has a corrupt row in it!")
data <- data[ !is.na(data$A) , ]
}
However, as if it is a factor
, the A
column remembers NA
as a level:
> summary(data$A)
Mode FALSE TRUE NA's
logical 185692 36978 0
This obviously causes issues when I am trying to fit a linear model. How can I get rid of the NA as a logical level here?
I tried this but doesn't seem to work:
A <- as.logical(droplevels(factor(data_combine$A)))
summary(A)
Mode FALSE TRUE NA's
logical 185692 36978 0
unique(A)
[1] FALSE TRUE
Upvotes: 0
Views: 1527
Reputation: 99351
As mentioned in my other answer, those actually are not factor levels. Since you asked how to remove the NA printing on summary
, I'm undeleting this answer.
The NA
printing is hard-coded into a summary for a logical vector. Here's the relevant code from summary.default
.
# value <- if (is.logical(object))
# c(Mode = "logical", {
# tb <- table(object, exclude = NULL)
# if (!is.null(n <- dimnames(tb)[[1L]]) && any(iN <- is.na(n)))
# dimnames(tb)[[1L]][iN] <- "NA's"
# tb
# })
The exclude = NULL
in table
is the problem. If we look at the exclude
argument in table
with a logical vector log
, we can see that when it is NULL
the NAs always print out.
log <- c(NA, logical(4), NA, !logical(2), NA)
table(log, exclude = NULL) ## with NA values
# log
# FALSE TRUE <NA>
# 4 2 3
table(log[!is.na(log)], exclude = NULL) ## NA values removed
#
# FALSE TRUE <NA>
# 4 2 0
To make your summary print the way you want it, we can write a summary
method based on the original source code.
summary.logvec <- function(object, exclude = NA) {
stopifnot(is.logical(object))
value <- c(Mode = "logical", {
tb <- table(object, exclude = exclude)
if(is.null(exclude)) {
if (!is.null(n <- dimnames(tb)[[1L]]) && any(iN <- is.na(n)))
dimnames(tb)[[1L]][iN] <- "NA's"
}
tb
})
class(value) <- c("summaryDefault", "table")
print.summary.logvec <- function(x) {
UseMethod("print.summaryDefault")
}
value
}
And then here are the results. Since we set exclude = NA
in our print method the NAs will not print unless we set it to NULL
summary(log) ## original vector
# Mode FALSE TRUE NA's
# logical 4 2 3
class(log) <- "logvec"
summary(log, exclude = NULL) ## prints NA when exclude = NULL
# Mode FALSE TRUE NA's
# logical 4 2 3
summary(log) ## NA's don't print
# Mode FALSE TRUE
# logical 4 2
Now that I've done all this I'm wondering if you have tried to run your linear model.
Upvotes: 0
Reputation: 99351
First, your data$A
is not a factor, it's a logical. The summary
print methods are not the same for factors and logicals. Logicals use summary.default
while factors dispatch to summary.factor
. Plus it tells you in the result that the variable is a logical.
fac <- factor(c(NA, letters[1:4]))
log <- c(NA, logical(4), !logical(2))
summary(fac)
# a b c d NA's
# 1 1 1 1 1
summary(log)
# Mode FALSE TRUE NA's
# logical 4 2 1
See ?summary
for the differences.
Second, your call
A <- as.logical(droplevels(factor(data_combine$A)))
summary(A)
is also calling summary.default
because you wrapped droplevels
with as.logical
(why?). So don't change data_combine$A
at all, and just try
summary(data_combine$A)
and see how that goes. For more information, please provide a sample of your data.
Upvotes: 1