Reputation: 28169
I have a data frame with a bunch of categorical variables. Some of them contain NA's and I use the addNA
function to convert them to an explicit factor level. My problem comes when I try to treat them as NA's they don't seem to register.
Here's my example data set and attempts to 'find' NA's:
df1 <- data.frame(id = 1:200, y =rbinom(200, 1, .5),
var1 = factor(rep(c('abc','def','ghi','jkl'),50)))
df1$var2 <- factor(rep(c('ab c','ghi','jkl','def'),50))
df1$var3 <- factor(rep(c('abc','ghi','nop','xyz'),50))
df1[df1$var1 == 'abc','var1'] <- NA
df1$var1 <- addNA(df1$var1)
df1$isNaCol <- ifelse(df1$var1 == NA, 1, 0);summary(df1$isNaCol)
df1$isNaCol <- ifelse(is.na(df1$var1), 1, 0);summary(df1$isNaCol)
df1$isNaCol <- ifelse(df1$var1 == 'NA', 1, 0);summary(df1$isNaCol)
df1$isNaCol <- ifelse(df1$var1 == '<NA>', 1, 0);summary(df1$isNaCol)
Also when I type ??addNA
I don't get any matches. Is this a gray-market function or something? Any suggestions would be appreciated.
Upvotes: 12
Views: 2595
Reputation: 307
I'm amazed such a simple question doesn't have a simple answer. I ran into the same situation I needed NA levels for a subset of my data pipeline. It turns out is.na()
works on the levels but not on the factor variable is itself. So my solution is based on that.
# create a factor variable with two levels and missing values
set.seed(1)
x <- factor(sample(c(0,1,NA), size = 10, replace = T))
x
#[1] 0 <NA> 0 1 0 <NA> <NA> 1 1 <NA>
#Levels: 0 1
# is.na works...
is.na(x)
#[1] FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
# add NA as a level
x <- addNA(x)
x
#[1] 0 <NA> 0 1 0 <NA> <NA> 1 1 <NA>
#Levels: 0 1 <NA>
# is.na doesn't work...
is.na(x)
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# get the level that is NA
na_level <- which(is.na(levels(x))) # 3
# Same as if using is.na() before using addNA()
!x %in% (levels(x)[-na_level])
# [1] FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
Applying this directly to your problem
na_level <- which(is.na(levels(df1$var1)))
df1$isNaCol <- ifelse(df1$var1 %in% levels(df1$var1)[-na_level], 1, 0);
summary(df1$isNaCol)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.00 0.75 1.00 0.75 1.00 1.00
table(df1$isNaCol)
# 0 1
# 50 150
Upvotes: 0
Reputation: 174908
Note that this is done with the OP's data before the call to addNA()
.
It is instructive to see what addNA()
does with this data.
> head(df1$var1)
[1] <NA> def ghi jkl <NA> def
Levels: abc def ghi jkl
> levels(df1$var1)
[1] "abc" "def" "ghi" "jkl"
> head(addNA(df1$var1))
[1] <NA> def ghi jkl <NA> def
Levels: abc def ghi jkl <NA>
> levels(addNA(df1$var1))
[1] "abc" "def" "ghi" "jkl" NA
addNA
is altering the levels of the factor such that missing-ness (NA
) is a level where by default R ignores it as what level the NA
values take is, of course, missing. It is also stripping out the NA
information - in a sense it is no longer unknown but part of a category "missing".
To look at the help for addNA
us ?addNA
.
If we look at the definition of addNA
we see that all it is doing is altering the levels
of the factor, not changing the data any:
> addNA
function (x, ifany = FALSE)
{
if (!is.factor(x))
x <- factor(x)
if (ifany & !any(is.na(x)))
return(x)
ll <- levels(x)
if (!any(is.na(ll)))
ll <- c(ll, NA)
factor(x, levels = ll, exclude = NULL)
}
Note that it doesn't otherwise change the data - the NA
are still there in the factor. We can replicate most of the behaviour of addNA
via:
with(df1, factor(var1, levels = c(levels(var1), NA), exclude = NULL))
> head(with(df1, factor(var1, levels = c(levels(var1), NA), exclude = NULL)))
[1] <NA> def ghi jkl <NA> def
Levels: abc def ghi jkl <NA>
However because NA
is now a level, those entries are not indicated as being missing via is.na()
That explains the second comparison you do not working (where you use is.na()
).
The only nicety you get from addNA
is that it doesn't add NA
as a level if it already exists as one. Also, via the ifany
you can stop it adding NA
as a level if there are no NA
s in the data.
Where you are going wrong is attempting to compare an NA
with something using the usual comparison methods (except your second example). If we don't know what value and NA
observation takes, how can we compare it with something? Well, we can't, other than with the internal representation of NA
. This is what is done by the is.na()
function:
> with(df1, head(is.na(var1), 10))
[1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
Hence I would do (without using addNA
at all)
df1 <- transform(df1, isNaCol = is.na(var1))
> head(df1)
id y var1 var2 var3 isNaCol
1 1 1 <NA> ab c abc TRUE
2 2 0 def ghi ghi FALSE
3 3 0 ghi jkl nop FALSE
4 4 0 jkl def xyz FALSE
5 5 0 <NA> ab c abc TRUE
6 6 1 def ghi ghi FALSE
If you want that as a 1
, 0
, variable, just add as.numeric()
as in
df1 <- transform(df1, isNaCol = as.numeric(is.na(var1)))
Where I think you are really going wrong is in wanting to attach an NA
level to the factor. I see addNA()
as a convenience function for use in things like table()
, and even that has arguments to not need the prior use of addNA()
, e.g.:
> with(df1, table(var1, useNA = "ifany"))
var1
abc def ghi jkl <NA>
0 50 50 50 50
Upvotes: 5
Reputation: 44614
Testing equality to NA
with the usual comparison operators always yields NA
---you want is.na
. Additionally, calling is.na
on a factor
test each level index (not the value associated with that index), so you want to convert the factor
to a character
vector first.
df1$isNaCol <- ifelse(is.na(as.character(df1$var1)), 1, 0);summary(df1$isNaCol)
Upvotes: 5
Reputation: 57696
Anything compared to NA is NA; this is why your first summary is all NA.
The addNA
function changes any NA observations in your factor to a new level. This level is then given the label NA (of character mode). The underlying variable itself no longer has any NAs. This is why your second summary is all 0.
To see how many observations have the NA level, use what Matthew Plourde posted.
Upvotes: 4