Rubens Rodrigues
Rubens Rodrigues

Reputation: 165

Principal Components Analysis:Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

I'm trying to execute a Principal Components Analysis, but I'm getting the error: Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

I know all the columns have to be numeric, but how to handle when you have character objects in the data set? E.g:

data(birth.death.rates.1966)
data2 <- birth.death.rates.1966
princ <- prcomp(data2)

enter image description here

Should I add a new column referring the country name to a numeric code? If yes, how to do this in R?

Upvotes: 9

Views: 85264

Answers (3)

Aditya Jadhav
Aditya Jadhav

Reputation: 1

In R, adding the factor method to a character set of data, does not make it numeric. Indeed it is to make our machine learning model a mathematical model but it is not numeric data.

Example: If you have a list of names and then they are being encoded numerically then it may happen that a certain name may have a higher numerical value which will give it a different definition depending on our model.
Which should not be the case as names(text data which is just for labeling a specific set) generally should not define the way a model should work.

Also if you try working with this data assuming it to be numeric, you may get the following error:

Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

I have defined why you may get this error above

To overcome this problem

training_set[,2:3] = scale(training_set)
test_set[,2:3] = scale(test_set)

In the following image, columns 1 and 4 have encoded data and cannot be treated as a numerical model Columns 2 and 3 have been originally containing numerical data so we can run our model only on that part of the data. The above code just shows how to select the data it includes all rows and columns 2 and 3 RStudio screen shot

Upvotes: 0

Spacedman
Spacedman

Reputation: 94182

You can convert a character vector to numeric values by going via factor. Then each unique value gets a unique integer code. In this example, there's four values so the numbers are 1 to 4, in alphabetical order, I think:

> d = data.frame(country=c("foo","bar","baz","qux"),x=runif(4),y=runif(4))
> d
  country          x         y
1     foo 0.84435112 0.7022875
2     bar 0.01343424 0.5019794
3     baz 0.09815888 0.5832612
4     qux 0.18397525 0.8049514
> d$country = as.numeric(as.factor(d$country))
> d
  country          x         y
1       3 0.84435112 0.7022875
2       1 0.01343424 0.5019794
3       2 0.09815888 0.5832612
4       4 0.18397525 0.8049514

You can then run prcomp:

> prcomp(d)
Standard deviations:
[1] 1.308665216 0.339983614 0.009141194

Rotation:
               PC1          PC2          PC3
country -0.9858920  0.132948161 -0.101694168
x       -0.1331795 -0.991081523 -0.004541179
y       -0.1013910  0.009066471  0.994805345

Whether this makes sense for your application is up to you. Maybe you just want to drop the first column: prcomp(d[,-1]) and work with the numeric data, which seems to be what the other "answers" are trying to achieve.

Upvotes: 11

parth
parth

Reputation: 1631

The first column of the data frame is character. So you can recode it to row names as :

library(tidyverse)
data2 %>% remove_rownames %>% column_to_rownames(var="country")
princ <- prcomp(data2)

Alternatively as :

data2 <- data2[,-1]
rownames(data2) <- data2[,1]
princ <- prcomp(data2)

Upvotes: 2

Related Questions