Efan Du
Efan Du

Reputation: 123

as.factor() in numeric conversion R

So I am learning R now, and I notice from many source that whenever we want to convert a datatype to numeric,we use two functions as.numeric() and as.factor().

For example if I want to convert column Year to numeric, it will be

as.numeric(as.factor(survey.data$Year))

I tried to use as.numeric() alone and it works perfect as well. But I feel there is sth I am missing out in this way. I am just wondering what is the reason to first convert to Factor , then to Numeric?

Thanks.

Upvotes: 2

Views: 23823

Answers (2)

simpson
simpson

Reputation: 335

If you want to convert column Year to numeric, perhaps what you are thinking of is the necessary process of converting a factor into a numeric, by which you have to convert into a character first.

You can do the same with a dataframe and subset the column, but here I am creating a simple example:

#Create a factor vector called Year with 3 levels
    Year <- factor(c(2001, 2001, 2001, 2004, 2004, 2020, 2020))
    Year
    [1] 2001 2001 2001 2004 2004 2020 2020
    Levels: 2001 2004 2020 

If you try to go straight from a factor into a numeric, you will see a numeric vector but instead of your original values, you will see which level each of your values matches. For example, the first level 2001 matches the first three values in Year, so you see 1 1 1 as the first three values in your numeric vector:

#Incorrect: convert Year into numeric directly
    nope.Year <- as.numeric(Year)
    nope.Year
    [1] 1 1 1 2 2 3 3

To correctly convert a factor into a numeric and truly return your original vector values, first convert into a character and then into a numeric, and you can do this with nested functions, since R starts from the inner-most parentheses and works outward as an order of operations:

#Correct: convert Year into a character, then into numeric
    num.Year <- as.numeric(as.character(Year))
    num.Year
    [1] 2001 2001 2001 2004 2004 2020 2020

Of note, if you are using an older version of R (prior to 4.0), then when you use data.frame() and read.table() variants, R converts your character strings into factors by default, unless you specify the argument stringsAsFactors = FALSE in each of these functions. If you did not specify stringsAsFactors = FALSE, then you would have to go through this process of converting your factor into a character and then into numeric.

If you are using a subsequent version of R (4.0 or higher), R no longer automatically converts character strings into factors when you use those functions or their variants, because now the default argument is stringsAsFactors = FALSE, and we can all celebrate.

Upvotes: 2

LachlanO
LachlanO

Reputation: 1162

I don't think you're missing anything here. The main thing to understand is how R converts data types, three common ones being numeric, character and factor. Factors are by far (in my opinion) the least intuitive having come from other languages.

Factors I like to think of as 'categories'. They have no order (unlike characters which can be ordered alphabetically). They're an abstract data type for listing stuff. Others might disagree with that explanation, but it what helped me understand.

I said Factors have no order, well that was kind of a lie for simplicity. As it turns out Factors also have levels. Levels list the order of things. Say we have a vector

animals <- factor(c("Rabbit", "Cat", "Dog"))

If we check its levels using levels(animals) it will return "Cat" "Dog" "Rabbit" in that order. This is because we created the vector as characters, so the default 'level order' is alphabetical.

We can change these level orders in ways I won't go into here, but if you wanted Rabbit to be the first level, you would need to set that manually. This means you can create order to these abstract variables.

If we used

as.numeric(animals)

It would result in c(3, 1, 2). And that's because numeric converts factors to an integer indicating their level order.

If you want to convert a factor, say "1" to the number 1, you would have to first convert it to a character, then a number.

This is because conversion between a factor and an integer has this behaviour. But conversion from factor to a character strips out the letters that make up the factor. Then conversion from character to numeric turns number characters to actual numbers.

So to get back to your example, I think just using as.numeric is fine, UNLESS you want to get the numbers which represent the order of the factor levels.

Upvotes: 0

Related Questions