Reputation: 49
I am working with a rather large document term matrix (~280,000 terms) in R, and am wondering if there is an efficient way to create lag variables for each of my original terms.
The following example gives a document term matrix with three terms. This works for a toy example like this, but would be impossible for my data.
A quick note on the lag structure: I am exploring whether the appearance of any given term may have some cumulative, though diminishing, amount of importance over time.
dtm <- data.frame(revenue=c(1,2,3,3,5,6), up=c(1,1,0,3,1,1), sale=c(0,1,1,0,1,1))
for (i in 1:nrow(dtm)){
if (i >=4){
dtm$revenueLag4days[i] <- dtm$revenue[(i-3):i]%*%c(0.25,0.5,0.75,1)
dtm$upLag4days[i] <- dtm$up[(i-3):i]%*% c(0.25,0.5,0.75,1)
dtm$saleLag4days[i] <- dtm$sale[(i-3):i]%*% c(0.25,0.5,0.75,1)
} else
dtm$revenueLag4days[i] <- dtm$upLag4days[i] <- dtm$saleLag4days[i] <- NA
}
Is it possible to rewrite this in a functional way for a document term matrix (~280,000 terms)?
Upvotes: 1
Views: 88
Reputation: 3055
The use of the if
statement and creation of vectors within the loop slows you down a fair bit. The loop below will be faster, and you could speed it up further by using parallel processing (e.g. foreach
)
# Create a data.frame to store your results in
ans <- data.frame(matrix(NA, nrow = nrow(dtm), ncol = ncol(dtm)))
# Give it the same column names as dtm
colnames(ans) <- colnames(dtm)
# Transpose dtm for matrix math
tdtm <- t(dtm)
# Create your row vector
mult_mat <- matrix(c(0.25, 0.5, 0.75, 1), ncol = 1, nrow = 4)
# Loop through your matrix
for(i in 4:nrow(dtm)){
ans[i,] <- tdtm[,(i-3):i] %*% mult_mat
}
This loop took ~52 seconds for a matrix with 280,000 columns and 100 rows.
Upvotes: 0