RJW
RJW

Reputation: 13

how to convert a factor variable into a numeric - using R

I have another problem and hope for your help. I googled already, asked a friend and tried to understand similar problems/questions around this website, but I still can't figure it out...

Ok so here's my problem: I have a large data set that covers data from 1980-2012. I used the read.spss function to get the data into R

rohdaten <-read.spss("C:\\Users\\xxxxxxx.sav", use.value.labels = TRUE, to.data.frame = TRUE,
        max.value.labels = Inf, trim.factor.names = FALSE,  
        trim_values = TRUE, reencode = NA, use.missings = TRUE)

That seems to work. Then I'd like to analyze variable 14 (v14) which is a likert-scale going from "totally agree" to "don't agree at all" and is therefore coded as a factor. I'd like to compare the change of the replies to this likert-scale over time and so I want to calculate the mean of that and in order to do so, it needs to be numeric. That's the first step of the issue... According to R for Dummies I need to change the factor into a character first and then change it into a numeric. Alright... here's my code... First of all I tried the recode()function which didn't work - then I just went on and created a new object "econ" that countains the variable14 sort of in copy. (so I don't affect the original v14 data in the workspace)

rohdaten$v14_2 <- recode(rohdaten$v14, "8 = NA; 9 = NA; 0 = NA; 1 = 1; 2 = 2; 3 = 3;  4 = 4; 5 = 5; as.factor.result = FALSE")  #should recode already - kinda doesn't work
class(rohdaten$v14_2) #just tells me it's a factor...
str(rohdaten$v14_2)
econ <- rohdaten$v14_2

With the "for Dummies-Website" in mind I change the stuff into characters and then into numeric

str(econ)
as.character(econ)
head(econ)
econ <- as.numeric(econ)
head(econ)

This for some reason gives me a "good" result, despite the "error" (??) in the "as character" line... If I go with econ <- as.character(econ) - I get "Warning message: NAs introduced by coercion" after the econ <- as.numeric(econ) command...

Ok so far it seems to work somehow I guess!?

But then I want to calculate the mean for every year (which is in variable 2) and I stumbled upon the function by() which looked like it's doing exactly what I want so my code turned out to be:

avgEconRat <- by(data = rohdaten, INDICES = rohdaten$v2, FUN = mean, na.rm = T)
head(avgEconRat) #actually gives me some means - not sure though whether it's the real means or the means of the "factor-number" that's mentioned in the "for-dummies-website" - sorry I can't explain it better :-(

Now I seem to have the data in the avgEconRat Object, but first of all, I'm not sure if my mean is correct at all, and secondly, and that's somehow the main issue, how do I refer to my data now to plot it?

p1 <- ggplot(na.action=na.exclude, rohdaten, aes(v14, v2))
p1 + geom_point(aes(color = v652), alpha = 0.6) +
      facet_grid(. ~ v5)

That's the code I had in mind - and I know I'd have to replace "rohdaten" with "econ" now, but since I have no idea how "econ" is structured (and also don't really know how to find out), I'm absolutely stuck here :-/ I feel like I have (or might have, depending whether my means are the right ones...) the data I need but kinda lost access to it.

Sorry for my weird problems, but learning programming without real mentoring is kinda tough without any previous experience.

Thank you very much for your patience, time and help!

Upvotes: 1

Views: 17000

Answers (2)

saladin1991
saladin1991

Reputation: 152

I had a similar problem with a dataset from 1988-2012, yet I was trying to change the variables' names into numbers. After quite a few hours of trying different combinations --I am also very new to R-- I found the following solution.

At first, I was doing this:

this requires "plyr" package

library(plyr)
my.data2$islamic_leviathan_score <- revalue(my.data2$islamic_leviathan,
               c("(1) Very Suitable"="3", "(2) Suitable"="2", "(3) Somewhat Suitable"="1", "(4) Not Suitable At All"="-1"))

The values were right, but R was not recognizing the variable as a numeric one. It was therefore impossible to draw a histogram or a regression.

Then I did this:

Islamic Leviathan

my.data2$islamic_leviathan <- c("3", "2", "1", "-1")

my.data2$islamic_leviathan_score <- as.factor(my.data2$islamic_leviathan)
my.data2$islamic_leviathan_score

my.data2$islamic_leviathan_score_1 <-as.numeric(as.character(my.data2$islamic_leviathan_score))

my.data2$islamic_leviathan_score_1

This operation did change the variable from a factor to a numeric one, but the problem is that the results (the values of the variable) were all changed after this operation, and my results were therefore completely wrong.

What I just did –and which seemed to solve the problem—is this:

library(plyr)
my.data2$islamic_leviathan_score <- revalue(my.data2$islamic_leviathan,
               c("(1) Very Suitable"="3", "(2) Suitable"="2", "(3) Somewhat Suitable"="1", "(4) Not Suitable At All"="-1"))

my.data2$islamic_leviathan_score_1 <- as.numeric(as.character(my.data2$islamic_leviathan_score))

I used a mix of both attempts, revaluating the potential values while transforming the variable as numeric ones. The results I get are now consistent with the original values contained in the dataset when the variables are as factor variables. You can use this solution to change the name of the variables to whatever you may like, while transforming them to numeric variables.

Upvotes: 0

Jthorpe
Jthorpe

Reputation: 10167

First, here's why you would have to convert to character before converting to numeric:

Lets say we have a factor that contains a handful of numbers

x = factor(c(1,2,7,7))

you can inspect how this is represented in R like so:

unclass(x)
#> [1] 1 2 3 3
#> attr(,"levels")
#> [1] "1" "2" "7"

and you would see that there are 3 levels, and that the values are represented as indexes to those 3 levels. Furthermore if you call as.numeric() directly, you get the index vector and not the values you were hoping for:

as.numeric(x)
#> [1] 1 2 3 3

On the other hand, if you have a likert scale, and the factor levels are in the correct order:

f = factor(c("agree","agree","somewhat agree","somewhat agree","somewhat disagree","disagree","disagree"))

levels(f)
#> [1] "agree" "disagree" "somewhat agree" "somewhat disagree"

you may actually want the index:

#> as.numeric(f)
[1] 1 1 3 3 4 2 2

If, however, your levels are out of order, as in:

f = factor(sample(c("agree","somewhat agree","somewhat disagree","disagree"),
                  20,
                  TRUE))
levels(f)
#> [1] "agree" "disagree" "somewhat agree" "somewhat disagree"

then instead of calling as.numeric(as.character(f)) (which makes no sense in this case), you'll want to re-order the factor levels, and then call as.numeric, like so:

as.numeric(factor(f,
                  # specifify the levels in the correct order:
                  levels=c("agree","somewhat agree","somewhat disagree","disagree"))

Upvotes: 3

Related Questions