NotWarrenBuffett
NotWarrenBuffett

Reputation: 1

Why aren't these two objects the same?

I'm new to R and Stack Overflow, so probably my question makes a lot of mistakes, sorry in advance.

I'm using caret's cor() function, and it took me an hour to fix a small problem, but I still don't understand what's wrong. Basically I have a data.frame, and I want to flag numeric variables that are highly correlated. So I create a subset of the numeric variables, except for SalePrice, which has NAs in the test set:

numericCols <- which(sapply(full[,!(names(full) %in% 'SalePrice')], is.numeric))   

Then

cor(full[,numericCols])    

gives an error:

Error in cor(full[, numericCols]) : 'x' must be numeric.

Except when I do it this way:

numericCols2 <- which(sapply(full, is.numeric))    
numericCols2 <- numericCols2[-31] #dropping SalePrice manually    

it works just fine.

When I do numericCols == numericCols2 the output is:

LotFrontage     
TRUE    
LotArea    
TRUE    
# .    
# .   All true    
# .    
HouseAge    
FALSE    
isNew    
FALSE    
Remodeled    
FALSE    
BsmtFinSF    
FALSE    
PorchSF    
FALSE    

All the ones that are false are variables I've created myself, for example HouseAge:

full$HouseAge <- full$YrSold - full$YearBuilt    

Why is this happening?

Upvotes: 0

Views: 67

Answers (1)

Katia
Katia

Reputation: 3914

Sale Price in your data.frame is probably character or some other non-numeric column. Here is an example to reproduce your problem and explanation why you get an error doing it one way and you do not get an error doing it the other way.

Let's simulate some data ( I use iris data set from MASS package and add a character column "SalePrice"):

data(iris)
full <- cbind(data.frame(SalePrice=rep("NA", nrow(iris))),iris)

If we examine the dataframe full, we will see that "SalePrice" column is character:

str(full)
# 'data.frame': 150 obs. of  6 variables:
#   $ SalePrice   : Factor w/ 1 level "NA": 1 1 1 1 1 1 1 1 1 1 ...
# $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Now let's examine what happens when you use the following function:

numericCols <- which(sapply(full[,!(names(full) %in% 'SalePrice')], is.numeric))
cor(full[, numericCols])
numericCols
# Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
# 1             2            3            4 

It returns you a numeric vector with column index within a subset full[,!(names(full) %in% 'SalePrice')] As you can see in my dataframe "SalePrice is the first column, so if I exclude it and then will try to find all numeric columns within the resulting data.frame I will get columns 1,2,3 and 4 instead of 2,3,4 and 5

And then when I execute cor() function, I get an error:

cor(full[, numericCols])
#Error in cor(full[, numericCols]) : 'x' must be numeric

Your other approach works as it returns correct column indices:

numericCols2 <- which(sapply(full, is.numeric))  
numericCols2
#Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
#           2            3            4            5  

Upvotes: 1

Related Questions