Reputation: 13
I'm using the first principal component from a PCA analysis as an explanatory variable in a forecasting model that forecasts recursively using Kalman filtering. In other words, at each point in time, the model updates and produces a new forecast based on the new observation included into the model. Since PCA uses data from all observations included in the model for its calculations, I need to run also the PCAs recursively, using only the observations prior to the point in time that I am forecasting (otherwise, the PCA-result could reveal information about the future, and help the model produce a more accurate answer than it would have otherwise). I think a loop might be the solution, but I am struggling with how to formulate the code.
As a more specific example, consider if I have the following data.frame
data <- as.data.frame(rbind(c(6,15,23),c(9,11,22), c(7,13,23), c(6,12,25),c(7,13,23)))
names(data) <- c("V1","V2","V3")
> data
V1 V2 V3
1 6 15 23
2 9 11 22
3 7 13 23
4 6 12 25
5 7 13 23
At each observation date, I wish to run a PCA (function prcomp()
from the stats-package) for all observations up to, and including, that observation. So I want to first run PCA for the two first observation
pca2 <- prcomp(data[1:2,], scale = TRUE)
next I want to run PCA with the first, second and third observation as input
pca3 <- prcomp(data[1:3,], scale = TRUE)
next I want to run PCA with the first, second, third and fourth observation as input
pca4 <- prcomp(data[1:4,], scale = TRUE)
and so on, until the last run of the PCA, which includes all observations in the dataframe. For each of these "runs" of the PCA, I wish to extract the last value (though for pca2, I use both the first and second value) of the first principal component (PC1), and merge these into a final dataframe, where each monthly observation is the last value of the first principal component of PCA results for each of the runs.
The principal component outputs are:
> my_pca2 <- as.data.frame(pca2$x)
> my_pca2
PC1 PC2
1 -1.224745 -5.551115e-17
2 1.224745 5.551115e-17
> my_pca3 <- as.data.frame(pca3$x)
> my_pca3
PC1 PC2 PC3
1 -1.4172321 -0.2944338 6.106227e-16
2 1.8732448 -0.1215046 3.330669e-16
3 -0.4560127 0.4159384 4.163336e-16
> my_pca4 <- as.data.frame(pca4$x)
> my_pca4
PC1 PC2 PC3
1 -1.03030993 -1.10154914 0.015457199
2 2.00769890 0.07649216 0.011670433
3 0.03301806 -0.24226508 -0.033461874
4 -1.01040702 1.26732205 0.006334242
So I want my final output
to be a dataframe to look like
>final.output
PC1
1 -1.224745
2 1.224745
3 -0.4560127
4 -1.01040702
Comment: yes, it looks a bit weird with the two first values, but please don't pay too much attention to that. My point is that I wish to build a dataframe that consists of the last calculated value for the first principal component for each of the PCA runs.
I am thinking that a for.loop might be the best solution here, but I have not been successful in finding any threads that might guide me closer to a coding solution. How can I make the loop use an increasing amount of the dataframe in the calculations? Does anyone have any suggestions/tips/links? Any help on this is much appreciated!
Upvotes: 1
Views: 288
Reputation: 173813
You can use a for loop. It's maybe not the most efficient solution, but it will work.
First, you create an empty list to store your results:
all_results <- list()
Next, you iterate from 2 to the number of rows of data
with a loop. For each iteration of the loop, run prcomp
on data[1:i,]
. You can directly create your pca data frame and extract PC1
from it as a vector. Now you store it in the list at index i - 1
for(i in 2:nrow(data))
{
all_results[[i - 1]] <- as.data.frame(prcomp(data[1:i,], scale = TRUE)$x)$PC1
}
Now to extract all the results, you use lapply
(list apply) to extract only the last element from each PC1 vector:
PC1 <- lapply(all_results, function(pca) pca[length(pca)] )
Now you convert these from a list of single elements to a vector:
PC1 <- do.call("c", PC1)
Finally, you want to stick the first value of the first analysis back on to the front of this vector:
PC1 <- c(all_results[[1]][1], PC1)
Upvotes: 0
Reputation: 18683
I had a very similar approach.
PCA <- vector("list", length=nrow(data)-1)
for(i in 1:(nrow(data)-1)) {
if(i==1) j <- 1:2 else j<-i+1
PCA[[i]] <- as.data.frame(prcomp(data[1:(1+i),], scale = TRUE)$x)[j, 1]
}
unlist(PCA)
Upvotes: 2