Reputation: 48436
I have factors in R that are salary ranges of the form $100,001 - $150,000
, over $150,000
, $25,000
, etc. and would like to convert these to numeric values (e.g. converting the factor $100,001 - $150,000
to the integer 125000).
Similarly I have educational categories such as High School Diploma
, Current Undergraduate
, PhD
, etc. that I would like to assign numbers to (e.g., giving PhD
a higher value than High School Diploma
).
How do I do this, given the dataframe containing these values?
Upvotes: 3
Views: 36401
Reputation: 213
You could use the recode function in the car
package.
For example:
library(car)
df$salary <- recode(df$salary,
"'$100,001 - $150,000'=125000;'$150,000'=150000")
For more information on how to use this function see the help file.
Upvotes: 8
Reputation: 25844
For converting the currency
# data
df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" ,
"$25,000"), educ = c("High School Diploma", "Current Undergraduate",
"PhD"),stringsAsFactors=FALSE)
# Remove comma and dollar sign
temp <- gsub("[,$]","", df$sal)
# remove text
temp <- gsub("[[:alpha:]]","", temp)
# get average over range
df$ave.sal <- sapply(strsplit(temp , "-") , function(i) mean(as.numeric(i)))
For your education levels - if you want it numeric
df$educ.f <- as.numeric(factor(df$educ , levels=c("High School Diploma" ,
"Current Undergraduate", "PhD")))
df
# sal educ ave.sal educ.f
# 1 $100,001 - $150,000 High School Diploma 125000.5 1
# 2 over $150,000 Current Undergraduate 150000.0 2
# 3 $25,000 PhD 25000.0 3
EDIT
Having missing / NA values should not matter
# Data that includes missing values
df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" ,
"$25,000" , NA), educ = c(NA, "High School Diploma",
"Current Undergraduate", "PhD"),stringsAsFactors=FALSE)
Rerun the above commands to get
df
# sal educ ave.sal educ.f
# 1 $100,001 - $150,000 <NA> 125000.5 NA
# 2 over $150,000 High School Diploma 150000.0 1
# 3 $25,000 Current Undergraduate 25000.0 2
# 4 <NA> PhD NA 3
Upvotes: 10
Reputation: 101
I'd just make a vector of values that map to the levels of your factor and map them in. The code below is a much less elegant solution than I'd have liked because I can't figure out how to do the indexing with a vector, but nonetheless this will do the job if your data's not overwhelmingly large. Say we want to map the factor elements of fact
to the numbers in vals
:
fact<-as.factor(c("a","b","c"))
vals<-c(1,2,3)
#for example:
vals[levels(fact)=="b"]
# gives: [1] 2
#now make an example data frame:
sample(1:3,10,replace=T)
data<-data.frame(fact[sample(1:3,10,replace=T)])
names(data)<-c("myvar")
#our vlookup function:
vlookup<-function(fact,vals,x) {
#probably should do an error checking to make sure fact
# and vals are the same length
out<-rep(vals[1],length(x))
for (i in 1:length(x)) {
out[i]<-vals[levels(fact)==x[i]]
}
return(out)
}
#test it:
data$myvarNumeric<-vlookup(fact,vals,data$myvar)
This should work for what you're describing.
Upvotes: 0