Reputation: 33
I would like to know if and how I could make my code more efficient by using vectorized functions instead of for
loops.
I am working on a dataset with around 1.6 million observations. I want to adjust the prices for inflation so I need to match the month of the observation with the month of the corresponding CPI index. I have a main data frame (the one with 1.6 million observations) and a data frame with the CPI index I need (this only has 12 observations, one for each month in the year my analysis is taking place).
Here is how I tried to "match" each observation with its corresponding CPI index:
`for(i in 1:nrow(large.data.frame)){
for(j in 1:nrow(CPI)){
if(months(large.data.frame[i,"Date"])==months(CPI[j,"Date"])){
CPImatch[i] <- CPI[j,2]
}
else next
}
}`
NOTE: CPImatch is a separate data frame I was going to use to place the matched values in and then cbind it with my initial data frame. As well, I know there is probably a better way to do this...
Since my code is still running, I know that this is an incredibly inefficient (and maybe even wrong) way of doing what I want to do. Is there a way of vectorizing this loop, possibly with a function from the apply
family?
Any feedback is greatly appreciated!
Upvotes: 2
Views: 99
Reputation: 9923
You code can certainly be made much faster. One simple step would be to pre-calculate the months rather than calculating it many many times. Vectorisation will make it even faster. I think the following code should work, mapping the months to CPI - difficult to test without some test data.
require(plyr)
CPImatch <- mapvalues(months(large.data.frame$Date), from = months(CPI$Date), to = CPI[,2])
Upvotes: 1