Reputation: 783
Problem
Working with a data frame in R, I want to change variables represented as characters into variables represented as numbers (i.e. from class chr
to num
).
For an entire data set, this is a straightforward problem (different flavors of solutions here, here, here, and here). However, I have one variable that needs to stay as characters.
Example Data
Using this example data (df
), let's say I want to change only var1
from class chr
to num
, leaving "chrOK"
as a chr
variable. In my real data set, there are many variables to change, so manual approaches like df$var1 = as.numeric(df$var1)
is too laborious.
df = data.frame(var1 = c("1","2","3","4"),
var2 = c(1,2,3,4),
chrOK = c("rick", "summer","beth", "morty"),
stringsAsFactors = FALSE)
str(df)
'data.frame': 4 obs. of 3 variables:
$ var1 : chr "1" "2" "3" "4"
$ var2 : num 1 2 3 4
$ chrOK: chr "rick" "summer" "beth" "morty"
Partial Solutions
I've tried a several approaches that seem close, but don't do exactly what I want.
Attempt 1 — introduces NAs
Most of my columns are characters that should be numeric, like "var1"
. So, using apply()
to convert class works. However, this approach fails induces NA
values in "chrOK"
.
df = as.data.frame(apply(df, 2, function(x) as.numeric(x)))
Warning message:
In FUN(newX[, i], ...) : NAs introduced by coercion
str(df)
'data.frame': 4 obs. of 3 variables:
$ var1 : num 1 2 3 4
$ var2 : num 1 2 3 4
$ chrOK: num NA NA NA NA
Attempt 2 — split, convert, cbind
Using apply()
on the subset of chr
variables, excluding "chrOK"
, doesn't induce NA
s, but requires using cbind()
to re-include "chrOK"
.
This solution is not ideal because cbind()
results are hard to check for data mutations. (Also, "chrOK"
is returned as a factor. Using df = cbind(changed,as.character(unchanged))
doesn't work. [a])
changed = as.data.frame(apply(df[-(which(colnames(df)=="chrOK"))],2,function(x) as.numeric(x)))
unchanged = (df$chrOK)
df = cbind(changed,unchanged)
str(df)
'data.frame': 4 obs. of 3 variables:
$ var1 : num 1 2 3 4
$ var2 : num 1 2 3 4
$ unchanged: Factor w/ 4 levels "beth","morty",..: 3 4 1 2 #[a]
Attempt 3 — correct subset, but error when converting
Using setdiff()
I get the subset of chr
class variables excluding `"chrOK".
df[setdiff(names(df[sapply(df,is.character)]),"chrOK")]
var1
1 1
2 2
3 3
4 4
But trying to plug this into an apply function, so that only the subset is changed from chr
to num
returns an error (see [b]).
apply(as.numeric(df[setdiff(names(df[sapply(df,is.character)]),"chrOK")]),
2,function(x) as.numeric(x))
Error in apply(as.numeric(df[setdiff(names(df[sapply(df, is.character)]), :
(list) object cannot be coerced to type 'double' #[b]
Questions
Upvotes: 3
Views: 4359
Reputation: 887213
We can use type.convert
from base R
by looping over the columns of the dataset and assign it back to the original object
df[] <- lapply(df, function(x) type.convert(as.character(x), as.is = TRUE))
str(df)
#'data.frame': 4 obs. of 3 variables:
#$ var1 : int 1 2 3 4
#$ var2 : int 1 2 3 4
#$ chrOK: chr "rick" "summer" "beth" "morty"
The type.convert
is calling a C code i.e. C_typeconvert
The reason why the OP's solutions are getting NAs are
1) apply
converts the data.frame
to matrix
and matrix
can hold only a single class
. Suppose there is a single character
element in the matrix
, it converts the whole into character
.
2) Using as.numeric
with apply
is problematic as the 'chrOK' is already a character
class column. Whenever as.numeric
is applied to non-numeric strings, it converts it NA.
3) The OP used the same apply
in the second method. It is described as in 1.
Upvotes: 2