Reputation: 10001
When I convert a factor to a numeric or integer, I get the underlying level codes, not the values as numbers.
f <- factor(sample(runif(5), 20, replace = TRUE))
## [1] 0.0248644019011408 0.0248644019011408 0.179684827337041
## [4] 0.0284090070053935 0.363644931698218 0.363644931698218
## [7] 0.179684827337041 0.249704354675487 0.249704354675487
## [10] 0.0248644019011408 0.249704354675487 0.0284090070053935
## [13] 0.179684827337041 0.0248644019011408 0.179684827337041
## [16] 0.363644931698218 0.249704354675487 0.363644931698218
## [19] 0.179684827337041 0.0284090070053935
## 5 Levels: 0.0248644019011408 0.0284090070053935 ... 0.363644931698218
as.numeric(f)
## [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2
as.integer(f)
## [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2
I have to resort to paste
to get the real values:
as.numeric(paste(f))
## [1] 0.02486440 0.02486440 0.17968483 0.02840901 0.36364493 0.36364493
## [7] 0.17968483 0.24970435 0.24970435 0.02486440 0.24970435 0.02840901
## [13] 0.17968483 0.02486440 0.17968483 0.36364493 0.24970435 0.36364493
## [19] 0.17968483 0.02840901
Is there a better way to convert a factor to numeric?
Upvotes: 715
Views: 1153826
Reputation: 52319
The collapse
package includes a wrapper around as.numeric(levels(f))[f]
and as.character(levels(f))[f]
in as_numeric_factor
and as_character_factor
.
library(collapse)
set.seed(1)
f <- factor(sample(runif(5), 5, replace = TRUE))
as_numeric_factor(f)
# [1] 0.2016819 0.5728534 0.3721239 0.5728534 0.5728534
as_character_factor(f)
# [1] "0.201681931037456" "0.572853363351896" "0.37212389963679" "0.572853363351896" "0.572853363351896"
It gives similar performances compared to as.numeric(levels(f))[f]
.
# Unit: milliseconds
# expr min lq mean median uq max neval
# as.numeric(levels(f))[f] 2.6026 3.01305 5.834900 3.54310 8.57450 66.3497 100
# as.numeric(levels(f)[f]) 317.2509 336.78690 350.215388 349.85620 361.57980 401.1002 100
# as_numeric_factor(f) 2.5793 2.92970 5.383223 3.23355 4.29355 68.4460 100
Code:
set.seed(1)
f <- factor(sample(runif(5), 1e6, replace = TRUE))
library(microbenchmark)
microbenchmark(
as.numeric(levels(f))[f],
as.numeric(levels(f)[f]),
as_numeric_factor(f),
times = 100
)
Upvotes: 0
Reputation: 21
I found as.numeric(levels(f))[f]
difficult to apply across a list of column names using tidyverse syntax. Converting to a character first then an integer gave me the original numeric values without having to add additional packages. Perhaps not the most performant/elegant solution but kept things simple and readable.
library(tidyverse)
tbl_df <- tibble(a = as.factor(c("7", "3")),
b = as.factor(c("1.5", "6.3")))
cols <- c("a", "b")
tbl_df %>%
mutate(across(all_of(cols), as.character)) %>%
mutate(across(all_of(cols), as.numeric))
Upvotes: 1
Reputation: 47
If you have many factor
columns to convert to numeric
,
df <- rapply(df, function(x) as.numeric(levels(x))[x], "factor", how = "replace")
This solution is robust for data.frames
containing mixed types, provided all factor levels are numbers.
Upvotes: 1
Reputation: 6277
R has a number of (undocumented) convenience functions for converting factors:
as.character.factor
as.data.frame.factor
as.Date.factor
as.list.factor
as.vector.factor
But annoyingly, there is nothing to handle the factor -> numeric conversion. As an extension of Joshua Ulrich's answer, I would suggest to overcome this omission with the definition of your own idiomatic function:
as.double.factor <- function(x) {as.numeric(levels(x))[x]}
that you can store at the beginning of your script, or even better in your .Rprofile
file.
Upvotes: 111
Reputation: 3829
The most easiest way would be to use unfactor
function from package varhandle which can accept a factor vector or even a dataframe:
unfactor(your_factor_variable)
This example can be a quick start:
x <- rep(c("a", "b", "c"), 20)
y <- rep(c(1, 1, 0), 20)
class(x) # -> "character"
class(y) # -> "numeric"
x <- factor(x)
y <- factor(y)
class(x) # -> "factor"
class(y) # -> "factor"
library(varhandle)
x <- unfactor(x)
y <- unfactor(y)
class(x) # -> "character"
class(y) # -> "numeric"
You can also use it on a dataframe. For example the iris
dataset:
sapply(iris, class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species "numeric" "numeric" "numeric" "numeric" "factor"
# load the package
library("varhandle")
# pass the iris to unfactor
tmp_iris <- unfactor(iris)
# check the classes of the columns
sapply(tmp_iris, class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species "numeric" "numeric" "numeric" "numeric" "character"
# check if the last column is correctly converted
tmp_iris$Species
[1] "setosa" "setosa" "setosa" "setosa" "setosa" [6] "setosa" "setosa" "setosa" "setosa" "setosa" [11] "setosa" "setosa" "setosa" "setosa" "setosa" [16] "setosa" "setosa" "setosa" "setosa" "setosa" [21] "setosa" "setosa" "setosa" "setosa" "setosa" [26] "setosa" "setosa" "setosa" "setosa" "setosa" [31] "setosa" "setosa" "setosa" "setosa" "setosa" [36] "setosa" "setosa" "setosa" "setosa" "setosa" [41] "setosa" "setosa" "setosa" "setosa" "setosa" [46] "setosa" "setosa" "setosa" "setosa" "setosa" [51] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [56] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [61] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [66] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [71] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [76] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [81] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [86] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [91] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [96] "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" [101] "virginica" "virginica" "virginica" "virginica" "virginica" [106] "virginica" "virginica" "virginica" "virginica" "virginica" [111] "virginica" "virginica" "virginica" "virginica" "virginica" [116] "virginica" "virginica" "virginica" "virginica" "virginica" [121] "virginica" "virginica" "virginica" "virginica" "virginica" [126] "virginica" "virginica" "virginica" "virginica" "virginica" [131] "virginica" "virginica" "virginica" "virginica" "virginica" [136] "virginica" "virginica" "virginica" "virginica" "virginica" [141] "virginica" "virginica" "virginica" "virginica" "virginica" [146] "virginica" "virginica" "virginica" "virginica" "virginica"
Upvotes: 43
Reputation: 34586
type.convert(f)
on a factor whose levels are completely numeric is another base option.
Performance-wise it's about equivalent to as.numeric(as.character(f))
but not nearly as quick as as.numeric(levels(f))[f]
.
identical(type.convert(f), as.numeric(levels(f))[f])
[1] TRUE
That said, if the reason the vector was created as a factor in the first instance has not been addressed (i.e. it likely contained some characters that could not be coerced to numeric) then this approach won't work and it will return a factor.
levels(f)[1] <- "some character level"
identical(type.convert(f), as.numeric(levels(f))[f])
[1] FALSE
Upvotes: 3
Reputation: 7
Looks like the solution as.numeric(levels(f))[f] no longer work with R 4.0.
Alternative solution:
factor2number <- function(x){
data.frame(levels(x), 1:length(levels(x)), row.names = 1)[x, 1]
}
factor2number(yourFactor)
Upvotes: -2
Reputation: 1742
From the many answers I could read, the only given way was to expand the number of variables according to the number of factors. If you have a variable "pet" with levels "dog" and "cat", you would end up with pet_dog and pet_cat.
In my case I wanted to stay with the same number of variables, by just translating the factor variable to a numeric one, in a way that can applied to many variables with many levels, so that cat=1 and dog=0 for instance.
Please find the corresponding solution below:
crime <- data.frame(city = c("SF", "SF", "NYC"),
year = c(1990, 2000, 1990),
crime = 1:3)
indx <- sapply(crime, is.factor)
crime[indx] <- lapply(crime[indx], function(x){
listOri <- unique(x)
listMod <- seq_along(listOri)
res <- factor(x, levels=listOri)
res <- as.numeric(res)
return(res)
}
)
Upvotes: -1
Reputation: 1449
Note: this particular answer is not for converting numeric-valued factors to numerics, it is for converting categorical factors to their corresponding level numbers.
Every answer in this post failed to generate results for me , NAs were getting generated.
y2<-factor(c("A","B","C","D","A"));
as.numeric(levels(y2))[y2]
[1] NA NA NA NA NA Warning message: NAs introduced by coercion
What worked for me is this -
as.integer(y2)
# [1] 1 2 3 4 1
Upvotes: 47
Reputation: 1690
late to the game, accidently, I found trimws()
can convert factor(3:5)
to c("3","4","5")
. Then you can call as.numeric()
. That is:
as.numeric(trimws(x_factor_var))
Upvotes: 4
Reputation: 1960
You can use hablar::convert
if you have a data frame. The syntax is easy:
Sample df
library(hablar)
library(dplyr)
df <- dplyr::tibble(a = as.factor(c("7", "3")),
b = as.factor(c("1.5", "6.3")))
Solution
df %>%
convert(num(a, b))
gives you:
# A tibble: 2 x 2
a b
<dbl> <dbl>
1 7. 1.50
2 3. 6.30
Or if you want one column to be integer and one numeric:
df %>%
convert(int(a),
num(b))
results in:
# A tibble: 2 x 2
a b
<int> <dbl>
1 7 1.50
2 3 6.30
Upvotes: 5
Reputation: 176698
See the Warning section of ?factor
:
In particular,
as.numeric
applied to a factor is meaningless, and may happen by implicit coercion. To transform a factorf
to approximately its original numeric values,as.numeric(levels(f))[f]
is recommended and slightly more efficient thanas.numeric(as.character(f))
.
The FAQ on R has similar advice.
Why is as.numeric(levels(f))[f]
more efficent than as.numeric(as.character(f))
?
as.numeric(as.character(f))
is effectively as.numeric(levels(f)[f])
, so you are performing the conversion to numeric on length(x)
values, rather than on nlevels(x)
values. The speed difference will be most apparent for long vectors with few levels. If the values are mostly unique, there won't be much difference in speed. However you do the conversion, this operation is unlikely to be the bottleneck in your code, so don't worry too much about it.
Some timings
library(microbenchmark)
microbenchmark(
as.numeric(levels(f))[f],
as.numeric(levels(f)[f]),
as.numeric(as.character(f)),
paste0(x),
paste(x),
times = 1e5
)
## Unit: microseconds
## expr min lq mean median uq max neval
## as.numeric(levels(f))[f] 3.982 5.120 6.088624 5.405 5.974 1981.418 1e+05
## as.numeric(levels(f)[f]) 5.973 7.111 8.352032 7.396 8.250 4256.380 1e+05
## as.numeric(as.character(f)) 6.827 8.249 9.628264 8.534 9.671 1983.694 1e+05
## paste0(x) 7.964 9.387 11.026351 9.956 10.810 2911.257 1e+05
## paste(x) 7.965 9.387 11.127308 9.956 11.093 2419.458 1e+05
Upvotes: 856
Reputation: 5536
It is possible only in the case when the factor labels match the original values. I will explain it with an example.
Assume the data is vector x
:
x <- c(20, 10, 30, 20, 10, 40, 10, 40)
Now I will create a factor with four labels:
f <- factor(x, levels = c(10, 20, 30, 40), labels = c("A", "B", "C", "D"))
1) x
is with type double, f
is with type integer. This is the first unavoidable loss of information. Factors are always stored as integers.
> typeof(x)
[1] "double"
> typeof(f)
[1] "integer"
2) It is not possible to revert back to the original values (10, 20, 30, 40) having only f
available. We can see that f
holds only integer values 1, 2, 3, 4 and two attributes - the list of labels ("A", "B", "C", "D") and the class attribute "factor". Nothing more.
> str(f)
Factor w/ 4 levels "A","B","C","D": 2 1 3 2 1 4 1 4
> attributes(f)
$levels
[1] "A" "B" "C" "D"
$class
[1] "factor"
To revert back to the original values we have to know the values of levels used in creating the factor. In this case c(10, 20, 30, 40)
. If we know the original levels (in correct order), we can revert back to the original values.
> orig_levels <- c(10, 20, 30, 40)
> x1 <- orig_levels[f]
> all.equal(x, x1)
[1] TRUE
And this will work only in case when labels have been defined for all possible values in the original data.
So if you will need the original values, you have to keep them. Otherwise there is a high chance it will not be possible to get back to them only from a factor.
Upvotes: 13