Reputation: 1372
I am trying to convert factor variables into numeric. I have tried both these solutions -
as.numeric(levels(f))[f]
as.numeric(as.character(f))
But the issue persists. Warning Message - NAs introduced by coercion
Reproducible example -
df = data.frame(x = c("10: Already Delinquent 90+",
"11: Credit History <6 Months",
"12: Current Balance = 0",
"13: Balance (2-6)=0",
"20: 1+ x 90+",
"30: 3+ x 60-89",
"31: 2 x 60-89",
"32: 1 x 60-89",
"40: 3+ x 30-59",
"41: 2 x 30-59",
"42: 1 x 30-59",
"50: Insufficient Performance",
"60: 3+ x 1-29",
"61: 2 x 1-29",
"62: 1 x 1-29",
"70: Never delinquent"),
y = c("00:Bad",
"01:Ind",
"02:Good",
"NA",
"00:Bad",
"01:Ind",
"02:Good",
"NA",
"00:Bad",
"01:Ind",
"02:Good",
"NA",
"00:Bad",
"01:Ind",
"02:Good",
"NA"),
z = ceiling(rnorm(16)))
#Select all the factor variables
factorvars = colnames(df)[which(sapply(df,is.factor))]
#Concatenate with "_Num"
xxx <- paste(factorvars, "_Num", sep="")
#Converting Factor to Numeric
for (i in 1:length(factorvars))
df[,xxx[i]] = NA
df[,xxx[i]] = as.numeric(levels(df[,factorvars[i]]) [df[,factorvars[i]]])
I want to retain factor variables and create new variables with conversion of levels to numeric. The desired output looks like below -
x y x_num y_num
10: Already Delinquent 90+ 00:Bad 1 1
11: Credit History <6 Months 01:Ind 2 2
12: Current Balance = 0 02:Good 3 3
13: Balance (2-6)=0 NA 4 NA
20: 1+ x 90+ 00:Bad 5 1
30: 3+ x 60-89 01:Ind 6 2
31: 2 x 60-89 02:Good 7 3
32: 1 x 60-89 NA 8 NA
40: 3+ x 30-59 00:Bad 9 1
41: 2 x 30-59 01:Ind 10 2
42: 1 x 30-59 02:Good 11 3
50: Insufficient Performance NA 12 NA
60: 3+ x 1-29 00:Bad 13 1
61: 2 x 1-29 01:Ind 14 2
62: 1 x 1-29 02:Good 15 3
70: Never delinquent NA 16 NA
Upvotes: 0
Views: 260
Reputation: 28441
Judging by your desired output, it doesn't look like you want to convert the factors to the numbers contained in their strings. Instead you want the internal representation of the factors.
Try this:
df[,xxx] <- lapply(df[,factorvars], as.numeric)
# x y z x_Num y_Num
# 1 10: Already Delinquent 90+ 00:Bad 2 1 1
# 2 11: Credit History <6 Months 01:Ind 2 2 2
# 3 12: Current Balance = 0 02:Good 1 3 3
# 4 13: Balance (2-6)=0 <NA> 1 4 NA
# 5 20: 1+ x 90+ 00:Bad 0 5 1
# 6 30: 3+ x 60-89 01:Ind 0 6 2
# 7 31: 2 x 60-89 02:Good 0 7 3
# 8 32: 1 x 60-89 <NA> 0 8 NA
# 9 40: 3+ x 30-59 00:Bad 2 9 1
# 10 41: 2 x 30-59 01:Ind 0 10 2
# 11 42: 1 x 30-59 02:Good 0 11 3
# 12 50: Insufficient Performance <NA> 1 12 NA
# 13 60: 3+ x 1-29 00:Bad 1 13 1
# 14 61: 2 x 1-29 01:Ind -1 14 2
# 15 62: 1 x 1-29 02:Good -1 15 3
# 16 70: Never delinquent <NA> -1 16 NA
Data
I cleaned your example data by changing the character string "NA" to actual NA values:
is.na(df$y) <- df$y == "NA"
df$y <- droplevels(df$y)
Upvotes: 2