Y.Z.
Y.Z.

Reputation: 13

Factor variable coerced into character when replacing NA with a number using apply()

I noticed a peculiar behavior of apply() when trying to replace NA with number 9 for multiple factor variables. I already defined the levels and labels of those variables. When I use ifelse() for each variable individually (e.g ifelse(is.na(x),9,x), it coerced the variable into integer, which is understandable. However, when I made a function to do exactly the same and use apply() over multiple columns, it coerced all variables into character. Adding one more step to convert them back to factor in the function doesn't help. Did I miss something or is it thing strange about apply() functions? Thanks!

a<-c(1,2,3,NA,2)
b<-c(2,1,2,2,NA)
a<-factor(a,levels=c(1,2,3),labels=c("First","Second","Third"))
b<-factor(b,levels=c(1,2,3), labels=c("AA","BB","CC"))
dat<-cbind(a,b)
replace.na<-function(x){
    x<-as.factor(ifelse(is.na(x),9,x))
}
a<-ifelse(is.na(a),9,a)
str(a)
dat<-apply(dat,2,replace.na)
str(dat)

I would expect the apply() will produce the same type of variables, or at least using as.factor() in the function will coerce the variable into a factor.

Upvotes: 1

Views: 526

Answers (1)

IRTFM
IRTFM

Reputation: 263301

A major difficulty in dealing with factors is that they cannot accept an assignment of a value that is not in the existing levels. Your example doesn't exemplify that since you used cbind which coerces the factors to their underlying integer values. Factors are really integer vectors with a levels attribute. If you want to get a structure that will accept assignments outside the existing levels, then you have two options: 1) convert the factors with as.character or 2) first augment the factor levels with levels(fac) <- c(levels(fac), new_values).

Since you want to work on multiple columns in a matrix, I think it would be better to use the first option of converting to character before using cbind.

 a<-c(1,2,3,NA,2)
 b<-c(2,1,2,2,NA)
 a<-factor(a,levels=c(1,2,3),labels=c("First","Second","Third"))
 b<-factor(b,levels=c(1,2,3), labels=c("AA","BB","CC"))
 dat<-cbind( as.character(a), as.character(b))
 replace.na<-function(x){
     x<-as.factor(ifelse(is.na(x), 9, x))
 }
 a<-ifelse(is.na(a),9,a)
 str(a)
num [1:5] 1 2 3 9 2    #shows the underlying numeric values after changing `a`
 dat<-apply(dat,2,replace.na)
 str(dat)             # the dat object was not affected by the second modification of `a`
chr [1:5, 1:2] "First" "Second" "Third" "9" "Second" "BB" "AA" "BB" "BB" ...
dat
#---------------
     [,1]     [,2]
[1,] "First"  "BB"
[2,] "Second" "AA"
[3,] "Third"  "BB"
[4,] "9"      "BB"
[5,] "Second" "9" 

Upvotes: 0

Related Questions