Skårup
Skårup

Reputation: 33

Compare PCs to data with lsfit()

I have a data frame with 2000 observations (rows) and 600 variables (columns). See reproducible example:

list <- list()

for(i in 1:600){
  list[[i]] <- sample(seq(0,0.6,l=2000))
}

df <- as.data.frame(do.call(cbind,list))

I want to perform PCA on the variables and then use lsfit to compare the fit between the principal components and the data (as well as some other data, but this is left out here). My first issue is that when I perform PCA on the data set as it is, my principle components have length 20000. I would expect them to have length 600. However, this is resolved by transposing the data frame.

pc_model <- prcomp(df, center=F, rank=3)
pcs <- pc_model$x # wrong length, why?


df_trans <- as.data.frame(t(df))
pc_model2 <- prcomp(df_trans, center=F, rank=3)
pcs2 <- pc_model2$x # correct length, why?

My next issue is that when I try to use lsfit() to compare my 2000 observations to the principal components, I get all sorts of complaints:

fit <- lsfit(df_trans, pcs2) # Error in lsfit(df_trans, pcs2) : only 600 cases, but 2001 variables
fit2 <- lsfit(df, pcs2) # Error in complete.cases(x, y, wt) : not all arguments have the same length
fit3 <- lsfit(df[1,], pcs2[,1]) # Error in complete.cases(x, y, wt) : not all arguments have the same length

With the transposed data frame, lsfit() complains that I have too many variables. With the non-transposed data frame, it argues that the arguments don´t have the same length, even when I only feed it one row from df (length=600) and one column from pcs2 (length=600). How do I get the least squared fits between my PCs and my 20000 observations?

Upvotes: 0

Views: 59

Answers (1)

Abdessabour Mtk
Abdessabour Mtk

Reputation: 3888

  1. first pc_model$x is just the coordinates of the observations in the new space defined by axises (PC1, PC2, PC3), so you'll have as many rows as there are observations, i.e 2000 rows for 2000 observations.
  2. ls.fit(X, Y) is trying to fit the model Y = Xb + e where Y and e are (N,M) matrices, X is (N,K) matrix and b is (K,M) vector. and K is the number of variables you want to use in the estimation (K=number of columns in the original X matrix + 1 if you want to calculate the coefficient of the intercept which is the default) also N>=K for this regression to be computable.
    • Running fit2 <- lsfit(df, pcs) will give correct output, as the conditions are verified, i.e same number of lines and N=2000>=K=601.
    • the error Error in lsfit(df_trans, pcs2) : only 600 cases, but 2001 variables is caused by df_trans having 2000 columns (variables + 1 for the intercept) while pcs2 having only 600 rows. selecting the first 599 columns circumvents the error lsfit(df_trans[,1:599] ,pcs2)
    • the error not all arguments have the same length is caused by the arguments complete.cases call inside of ls.fit because df and pcs2 have different row numbers this error is thrown before reaching the conditional on different row numbers inside of lsfit.

Upvotes: 0

Related Questions