How to fix "Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric" for PCA - prcomp?

So, I encountered this error trying to run the PCA via prcomp function on one of my datasets.

So the code I use is:

data(iris)
myPr <- prcomp(iris[, -5], scale = TRUE)
PCA <- cbind(iris, myPr$x)

then a ggplot2 part for the graph. So, in this example, Iris is a data.frame (class) with 4 numerical columns and a 5th character column.

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

I take the 5th out for the prcomp (as expected) and it works just fine. But then I tried using another dataset, I do the same conversions exept for the scale (as it is not needed) and have to remove more columns (columns 1-6, which are character - categorical variables). The code in question is as follows:

DATASET_PCA_MERGED <- read_xlsx ("C:/Users/i5/Desktop/TCGA MERGED Desktop.xlsx")
PCA_Input <- (DATASET_PCA_MERGED[, -c(1,2,3,4,5,6)])
myPr <- prcomp(PCA_Input)

Note that checking the class of PCA input you get the data.frame same as iris in the example/test The prcomp in ths case leads to:

Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

class(PCA_Input) [1] "tbl_df" "tbl" "data.frame"

All columns are set to character even though on excel they are numeric (tested importing in CSV too, same issue).

So I tried to convert to numeric in several ways following different posts here, but all of them required the object to be unlisted, which can't be done if Im going to run a PCA afterwards, as I need to keep the structure of this dataframe. Can someone help me on how could I convert columns to numeric while keeping the data frame structure?

Going even further on another approach, I used the raw data in .txt

DATASET <- read.delim("C:/Users/i5/Dropbox/Guilherme Vergara/Doutorado/Data/Datasets/TCGA LGG/EXP_DATA.txt")
na.omit(LGG_EXP)
PCA <- LGG_EXP[, -c(1,2)]
myPr <- prcomp(PCA)

And then I get a new error: Error in svd(x, nu=0, nv=k) : Infinite or missing values in 'x'

looking through other posts here, I tried to: all(is.finite(unlist(PCA))) [1] FALSE

So, I have some infinite values in this dataset. Not sure how to proceed here - either locate them for removal or another approach

> sapply(PCA, "is.infinite"(PCA))
Error in is.infinite(PCA) : 
  default method not implemented for type 'list'

> sapply(PCA, "is.infinite"(unlist(PCA)))
Error in match.fun(FUN) : 
  'is.infinite(unlist(PCA))' is not a function, character or symbol

I didn't go any further from this as I'm not sure what the problem is and Im clearing not using 'sapply' function correctly. In addition, I'd like to solve it without the need of getting access to the .txt files (as this will be a recurrent problem in my line of work). Can someone please try to help me with this?

Thanks in advance

Upvotes: 1

Views: 189

Answers (0)

Related Questions