Reputation: 123
So I am learning R now, and I notice from many source that whenever we want to convert a datatype to numeric,we use two functions as.numeric()
and as.factor()
.
For example if I want to convert column Year to numeric, it will be
as.numeric(as.factor(survey.data$Year))
I tried to use as.numeric()
alone and it works perfect as well. But I feel there is sth I am missing out in this way. I am just wondering what is the reason to first convert to Factor , then to Numeric?
Thanks.
Upvotes: 2
Views: 23823
Reputation: 335
If you want to convert column Year to numeric, perhaps what you are thinking of is the necessary process of converting a factor into a numeric, by which you have to convert into a character first.
You can do the same with a dataframe and subset the column, but here I am creating a simple example:
#Create a factor vector called Year with 3 levels
Year <- factor(c(2001, 2001, 2001, 2004, 2004, 2020, 2020))
Year
[1] 2001 2001 2001 2004 2004 2020 2020
Levels: 2001 2004 2020
If you try to go straight from a factor into a numeric, you will see a numeric vector but instead of your original values, you will see which level each of your values matches. For example, the first level 2001
matches the first three values in Year
, so you see 1 1 1
as the first three values in your numeric vector:
#Incorrect: convert Year into numeric directly
nope.Year <- as.numeric(Year)
nope.Year
[1] 1 1 1 2 2 3 3
To correctly convert a factor into a numeric and truly return your original vector values, first convert into a character and then into a numeric, and you can do this with nested functions, since R starts from the inner-most parentheses and works outward as an order of operations:
#Correct: convert Year into a character, then into numeric
num.Year <- as.numeric(as.character(Year))
num.Year
[1] 2001 2001 2001 2004 2004 2020 2020
Of note, if you are using an older version of R (prior to 4.0), then when you use data.frame()
and read.table()
variants, R converts your character strings into factors by default, unless you specify the argument stringsAsFactors = FALSE
in each of these functions. If you did not specify stringsAsFactors = FALSE
, then you would have to go through this process of converting your factor into a character and then into numeric.
If you are using a subsequent version of R (4.0 or higher), R no longer automatically converts character strings into factors when you use those functions or their variants, because now the default argument is stringsAsFactors = FALSE
, and we can all celebrate.
Upvotes: 2
Reputation: 1162
I don't think you're missing anything here. The main thing to understand is how R converts data types, three common ones being numeric
, character
and factor
. Factors are by far (in my opinion) the least intuitive having come from other languages.
Factors I like to think of as 'categories'. They have no order (unlike characters which can be ordered alphabetically). They're an abstract data type for listing stuff. Others might disagree with that explanation, but it what helped me understand.
I said Factors have no order, well that was kind of a lie for simplicity. As it turns out Factors also have levels
. Levels list the order of things. Say we have a vector
animals <- factor(c("Rabbit", "Cat", "Dog"))
If we check its levels using levels(animals)
it will return "Cat" "Dog" "Rabbit"
in that order. This is because we created the vector as characters, so the default 'level order' is alphabetical.
We can change these level orders in ways I won't go into here, but if you wanted Rabbit to be the first level, you would need to set that manually. This means you can create order to these abstract variables.
If we used
as.numeric(animals)
It would result in c(3, 1, 2)
. And that's because numeric converts factors to an integer indicating their level order.
If you want to convert a factor, say "1" to the number 1, you would have to first convert it to a character, then a number.
This is because conversion between a factor and an integer has this behaviour. But conversion from factor to a character strips out the letters that make up the factor. Then conversion from character to numeric turns number characters to actual numbers.
So to get back to your example, I think just using as.numeric
is fine, UNLESS you want to get the numbers which represent the order of the factor levels.
Upvotes: 0