Reputation: 1
I'm new to R and Stack Overflow, so probably my question makes a lot of mistakes, sorry in advance.
I'm using caret's cor()
function, and it took me an hour to fix a small problem, but I still don't understand what's wrong. Basically I have a data.frame
, and I want to flag numeric variables that are highly correlated. So I create a subset of the numeric variables, except for SalePrice
, which has NA
s in the test set:
numericCols <- which(sapply(full[,!(names(full) %in% 'SalePrice')], is.numeric))
Then
cor(full[,numericCols])
gives an error:
Error in cor(full[, numericCols]) : 'x' must be numeric.
Except when I do it this way:
numericCols2 <- which(sapply(full, is.numeric))
numericCols2 <- numericCols2[-31] #dropping SalePrice manually
it works just fine.
When I do numericCols == numericCols2
the output is:
LotFrontage
TRUE
LotArea
TRUE
# .
# . All true
# .
HouseAge
FALSE
isNew
FALSE
Remodeled
FALSE
BsmtFinSF
FALSE
PorchSF
FALSE
All the ones that are false are variables I've created myself, for example HouseAge
:
full$HouseAge <- full$YrSold - full$YearBuilt
Why is this happening?
Upvotes: 0
Views: 67
Reputation: 3914
Sale Price in your data.frame is probably character or some other non-numeric column. Here is an example to reproduce your problem and explanation why you get an error doing it one way and you do not get an error doing it the other way.
Let's simulate some data ( I use iris data set from MASS package and add a character column "SalePrice"):
data(iris)
full <- cbind(data.frame(SalePrice=rep("NA", nrow(iris))),iris)
If we examine the dataframe full, we will see that "SalePrice" column is character:
str(full)
# 'data.frame': 150 obs. of 6 variables:
# $ SalePrice : Factor w/ 1 level "NA": 1 1 1 1 1 1 1 1 1 1 ...
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Now let's examine what happens when you use the following function:
numericCols <- which(sapply(full[,!(names(full) %in% 'SalePrice')], is.numeric))
cor(full[, numericCols])
numericCols
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1 2 3 4
It returns you a numeric vector with column index within a subset full[,!(names(full) %in% 'SalePrice')]
As you can see in my dataframe "SalePrice is the first column, so if I exclude it and then will try to find all numeric columns within the resulting data.frame I will get columns 1,2,3 and 4 instead of 2,3,4 and 5
And then when I execute cor()
function, I get an error:
cor(full[, numericCols])
#Error in cor(full[, numericCols]) : 'x' must be numeric
Your other approach works as it returns correct column indices:
numericCols2 <- which(sapply(full, is.numeric))
numericCols2
#Sepal.Length Sepal.Width Petal.Length Petal.Width
# 2 3 4 5
Upvotes: 1