Reputation: 2638
I have a problem with the way R coerces variable types when using rbind
of two data.frames
with NA
values. I illustrate by example:
x<-factor(sample(1:3,10,T))
y<-rnorm(10)
dat<-data.frame(x,y)
NAs<-data.frame(matrix(NA,ncol=ncol(dat),nrow=nrow(dat)))
colnames(NAs)<-colnames(dat)
Now the goal is to append dat
and NAs
while keeping the variable types factor
and numeric
of x
and y
. When I give:
dat_forward<-rbind(dat,NAs)
is.factor(dat_forward$x)
this works fine. However the backward direction using rbind
fails:
dat_backward<-rbind(NAs,dat)
is.factor(dat_backward$x)
is.character(dat_backward$x)
Now x
is coerced to character level. I am confused - can't it stay factor type even if I use the other order of binding? What would be a straight forward change to my code to reach my goal?
Upvotes: 9
Views: 3479
Reputation: 44614
One approach would be to create NAs
with the correct column datatypes. This can be easily done with
NAs <- dat[NA,]
You can also make as many rows as desired with
num.rows <- 30
NAs <- dat[NA,][1:num.rows,]
Upvotes: 0
Reputation: 162321
Here's a fairly simple way to get the column classes right:
x <- rbind(dat[1,], NAs, dat)[-1,]
str(x)
# $ x: Factor w/ 3 levels "1","2","3": NA NA NA NA NA NA NA NA NA NA ...
# $ y: num NA NA NA NA NA NA NA NA NA NA ...
More generally, if you are really needing this often, you could create an rbind
-like function that takes an additional argument indicating the data.frame to whose column classes you'd like to coerce all of the others' columns:
myrbind <- function(x, ..., template=x) {
do.call(rbind, c(list(template[1,]), list(x), list(...)))[-1,]
}
str(myrbind(NAs, dat, template=dat))
# 'data.frame': 20 obs. of 2 variables:
# $ x: Factor w/ 3 levels "1","2","3": NA NA NA NA NA NA NA NA NA NA ...
# $ y: num NA NA NA NA NA NA NA NA NA NA ...
## If no 'template' argument is supplied, myrbind acts just like rbind
str(myrbind(dat, NAs))
# 'data.frame': 20 obs. of 2 variables:
# $ x: Factor w/ 3 levels "1","2","3": 3 3 3 3 2 3 1 1 3 2 ...
# $ y: num 0.303 1.77 -1.38 1.731 0.033 ...
Upvotes: 9
Reputation: 49448
data.frame
does a lot of things incorrectly when rbind
'ing different types together, and especially when that involves factors. Start using data.table
(1.8.11+) instead and you won't have these issues:
library(data.table)
dt1 = data.table(dat)
dt2 = data.table(NAs)
sapply(rbind(dt1, dt2), class)
# x y
# "factor" "numeric"
sapply(rbind(dt2, dt1), class)
# x y
# "factor" "numeric"
Upvotes: 3
Reputation: 18323
Similarly, you could just convert the column in NAs
to factor
NAs$x<-factor(NAs$x)
dat_backward<-rbind(NAs,dat)
is.factor(dat_backward$x) # TRUE
is.character(dat_backward$x) # FALSE
Upvotes: 3
Reputation: 44320
From ?rbind.data.frame
, we read: "It then takes the classes of the columns from the first data frame...". This is why you're seeing the order matter in your call to rbind
.
To get the variable classes of dat_forward
with the ordering of dat_backward
, you could just construct dat_forward
and reorder the rows:
dat_new = rbind(dat, NAs)[c((nrow(dat)+1):(nrow(dat)+nrow(NAs)), 1:nrow(dat)),]
str(dat_new)
# 'data.frame': 20 obs. of 2 variables:
# $ x: Factor w/ 3 levels "1","2","3": NA NA NA NA NA NA NA NA NA NA ...
# $ y: num NA NA NA NA NA NA NA NA NA NA ...
Upvotes: 2