user7958
user7958

Reputation: 13

Build loop to use increasing part of dataframe in R as input to function

I'm using the first principal component from a PCA analysis as an explanatory variable in a forecasting model that forecasts recursively using Kalman filtering. In other words, at each point in time, the model updates and produces a new forecast based on the new observation included into the model. Since PCA uses data from all observations included in the model for its calculations, I need to run also the PCAs recursively, using only the observations prior to the point in time that I am forecasting (otherwise, the PCA-result could reveal information about the future, and help the model produce a more accurate answer than it would have otherwise). I think a loop might be the solution, but I am struggling with how to formulate the code.

As a more specific example, consider if I have the following data.frame

data <- as.data.frame(rbind(c(6,15,23),c(9,11,22), c(7,13,23), c(6,12,25),c(7,13,23)))
names(data) <- c("V1","V2","V3")

> data
  V1 V2 V3
1  6 15 23
2  9 11 22
3  7 13 23
4  6 12 25
5  7 13 23

At each observation date, I wish to run a PCA (function prcomp() from the stats-package) for all observations up to, and including, that observation. So I want to first run PCA for the two first observation

pca2 <- prcomp(data[1:2,], scale = TRUE)

next I want to run PCA with the first, second and third observation as input

pca3 <- prcomp(data[1:3,], scale = TRUE)

next I want to run PCA with the first, second, third and fourth observation as input

pca4 <- prcomp(data[1:4,], scale = TRUE)

and so on, until the last run of the PCA, which includes all observations in the dataframe. For each of these "runs" of the PCA, I wish to extract the last value (though for pca2, I use both the first and second value) of the first principal component (PC1), and merge these into a final dataframe, where each monthly observation is the last value of the first principal component of PCA results for each of the runs.

The principal component outputs are:

> my_pca2 <- as.data.frame(pca2$x)
> my_pca2
        PC1           PC2
1 -1.224745 -5.551115e-17
2  1.224745  5.551115e-17

> my_pca3 <- as.data.frame(pca3$x)
> my_pca3
         PC1        PC2          PC3
1 -1.4172321 -0.2944338 6.106227e-16
2  1.8732448 -0.1215046 3.330669e-16
3 -0.4560127  0.4159384 4.163336e-16

> my_pca4 <- as.data.frame(pca4$x)
> my_pca4
          PC1         PC2          PC3
1 -1.03030993 -1.10154914  0.015457199
2  2.00769890  0.07649216  0.011670433
3  0.03301806 -0.24226508 -0.033461874
4 -1.01040702  1.26732205  0.006334242

So I want my final output to be a dataframe to look like

>final.output
         PC1
1  -1.224745
2   1.224745
3 -0.4560127
4 -1.01040702

Comment: yes, it looks a bit weird with the two first values, but please don't pay too much attention to that. My point is that I wish to build a dataframe that consists of the last calculated value for the first principal component for each of the PCA runs.

I am thinking that a for.loop might be the best solution here, but I have not been successful in finding any threads that might guide me closer to a coding solution. How can I make the loop use an increasing amount of the dataframe in the calculations? Does anyone have any suggestions/tips/links? Any help on this is much appreciated!

Upvotes: 1

Views: 288

Answers (2)

Allan Cameron
Allan Cameron

Reputation: 173813

You can use a for loop. It's maybe not the most efficient solution, but it will work.

First, you create an empty list to store your results:

all_results <- list()

Next, you iterate from 2 to the number of rows of data with a loop. For each iteration of the loop, run prcomp on data[1:i,]. You can directly create your pca data frame and extract PC1from it as a vector. Now you store it in the list at index i - 1

for(i in 2:nrow(data))
{
  all_results[[i - 1]] <- as.data.frame(prcomp(data[1:i,], scale = TRUE)$x)$PC1
}

Now to extract all the results, you use lapply (list apply) to extract only the last element from each PC1 vector:

PC1 <- lapply(all_results, function(pca) pca[length(pca)] )

Now you convert these from a list of single elements to a vector:

PC1 <- do.call("c", PC1)

Finally, you want to stick the first value of the first analysis back on to the front of this vector:

PC1 <- c(all_results[[1]][1], PC1)

Upvotes: 0

Edward
Edward

Reputation: 18683

I had a very similar approach.

PCA <- vector("list", length=nrow(data)-1)
for(i in 1:(nrow(data)-1)) {
  if(i==1) j <- 1:2 else j<-i+1
  PCA[[i]] <- as.data.frame(prcomp(data[1:(1+i),], scale = TRUE)$x)[j, 1]
}

unlist(PCA)

Upvotes: 2

Related Questions