Reputation: 25
I have a huge dataset with 750,000 IDs, for which I want to aggregate monthly values to yearly values by multiplying all values for a given ID. The ID consists of a combination of an identification number and a year.
The data I want to extract:
ID | monthly value |
---|---|
1 - 1997 | Product of Monthly Values in Year 1997 |
1 - 1998 | Product of Monthly Values in Year 1998 |
1 - 1999 | Product of Monthly Values in Year 1999 |
... | ... |
2 - 1997 | Product of Monthly Values in Year 1997 |
2 - 1998 | Product of Monthly Values in Year 1998 |
2 - 1999 | Product of Monthly Values in Year 1999 |
... | ... |
The dataset which is the source:
ID | monthly value |
---|---|
1 - 1997 | Monthly Value 1 in Year 1997 |
1 - 1997 | Monthly Value 2 in Year 1997 |
1 - 1997 | Monthly Value 3 in Year 1997 |
... | ... |
2 - 1997 | Monthly Value 1 in Year 1997 |
2 - 1997 | Monthly Value 2 in Year 1997 |
2 - 1997 | Monthly Value 3 in Year 1997 |
... | ... |
I have written a for loop, which takes about 0.74s for 10 IDs, which is way to slow. It would take about 15 hours for the whole data to run through. The for loop multiplies all monthly values for a given ID and stores it in a separate data frame.
for (i in 1:nrow(yearlyreturns)){
yearlyreturns[i, "yret"] <- prod(monthlyreturns[monthlyreturns$ID == yearlyreturns[i,"ID"],"change"]) - 1
yearlyreturns[i, "monthcount"] <- length(monthlyreturns[monthlyreturns$ID == yearlyreturns[i,"ID"],"change"])
}
I don't know how to get from here to a vectorised function, which takes less time.
Is this possible to do in R?
Upvotes: 1
Views: 264
Reputation: 11
In addition to the most excellent previous answers - here's a link to an earlier post comparing 10 common ways to calculate means by group. Data.table based solutions are definitely the way to go - especially for datasets with millions of rows. Unless you're writing to individual output files - I'm not sure why this would take hours rather than minutes.
Upvotes: 1
Reputation: 887183
Based on the for
loop code, this may be a done with a join
library(data.table)
setDT(yearlyreturns)[monthlyreturns, c("yret", "monthcount")
:= .(prod(change) -1, .N), on = .(ID), by = .EACHI]
Upvotes: 1
Reputation: 78937
Something like this:
library(dplyr)
df %>%
mutate(monthly_value = paste("Product of", str_replace(monthly_value, 'Value\\s\\d', 'Values'))) %>%
group_by(ID, monthly_value) %>%
summarise()
ID monthly_value
<chr> <chr>
1 1 - 1997 Product of Monthly Values in Year 1997
2 2 - 1997 Product of Monthly Values in Year 1997
data:
structure(list(ID = c("1 - 1997", "1 - 1997", "1 - 1997", "2 - 1997",
"2 - 1997", "2 - 1997"), monthly_value = c("Monthly Value 1 in Year 1997",
"Monthly Value 2 in Year 1997", "Monthly Value 3 in Year 1997",
"Monthly Value 1 in Year 1997", "Monthly Value 2 in Year 1997",
"Monthly Value 3 in Year 1997")), class = "data.frame", row.names = c(NA,
-6L))
Upvotes: 1