I have a huge dataset with 750,000 IDs, for which I want to aggregate monthly values to yearly values by multiplying all values for a given ID. The ID consists of a combination of an identification number and a year. The data I want to extract: ID monthly value 1 - 1997 Product of Monthly Values in Year 1997 1 - 1998 Product of Monthly Values in Year 1998 1 - 1999 Product of Monthly Values in Year 1999 ... ... 2 - 1997 Product of Monthly Values in Year 1997 2 - 1998 Product of Monthly Values in Year 1998 2 - 1999 Product of Monthly Values in Year 1999 ... ... The dataset which is the source: ID monthly value 1 - 1997 Monthly Value 1 in Year 1997 1 - 1997 Monthly Value 2 in Year 1997 1 - 1997 Monthly Value 3 in Year 1997 ... ... 2 - 1997 Monthly Value 1 in Year 1997 2 - 1997 Monthly Value 2 in Year 1997 2 - 1997 Monthly Value 3 in Year 1997 ... ... I have written a for loop, which takes about 0.74s for 10 IDs, which is way to slow. It would take about 15 hours for the whole data to run through. The for loop multiplies all monthly values for a given ID and stores it in a separate data frame. for (i in 1:nrow(yearlyreturns)){ yearlyreturns[i, "yret"] <- prod(monthlyreturns[monthlyreturns$ID == yearlyreturns[i,"ID"],"change"]) - 1 yearlyreturns[i, "monthcount"] <- length(monthlyreturns[monthlyreturns$ID == yearlyreturns[i,"ID"],"change"]) } I don't know how to get from here to a vectorised function, which takes less time. Is this possible to do in R?

Reputation: 25

How can I use a vectorised function to multiply all values in a different data frame for a given ID in R?

I have a huge dataset with 750,000 IDs, for which I want to aggregate monthly values to yearly values by multiplying all values for a given ID. The ID consists of a combination of an identification number and a year.

The data I want to extract:

ID	monthly value
1 - 1997	Product of Monthly Values in Year 1997
1 - 1998	Product of Monthly Values in Year 1998
1 - 1999	Product of Monthly Values in Year 1999
...	...
2 - 1997	Product of Monthly Values in Year 1997
2 - 1998	Product of Monthly Values in Year 1998
2 - 1999	Product of Monthly Values in Year 1999
...	...

The dataset which is the source:

ID	monthly value
1 - 1997	Monthly Value 1 in Year 1997
1 - 1997	Monthly Value 2 in Year 1997
1 - 1997	Monthly Value 3 in Year 1997
...	...
2 - 1997	Monthly Value 1 in Year 1997
2 - 1997	Monthly Value 2 in Year 1997
2 - 1997	Monthly Value 3 in Year 1997
...	...

I have written a for loop, which takes about 0.74s for 10 IDs, which is way to slow. It would take about 15 hours for the whole data to run through. The for loop multiplies all monthly values for a given ID and stores it in a separate data frame.

for (i in 1:nrow(yearlyreturns)){
  
  yearlyreturns[i, "yret"] <- prod(monthlyreturns[monthlyreturns$ID == yearlyreturns[i,"ID"],"change"]) - 1
  yearlyreturns[i, "monthcount"] <- length(monthlyreturns[monthlyreturns$ID == yearlyreturns[i,"ID"],"change"])
  
}

I don't know how to get from here to a vectorised function, which takes less time.

Is this possible to do in R?

Upvotes: 1

Answers (3)

Norbs

Reputation: 11

In addition to the most excellent previous answers - here's a link to an earlier post comparing 10 common ways to calculate means by group. Data.table based solutions are definitely the way to go - especially for datasets with millions of rows. Unless you're writing to individual output files - I'm not sure why this would take hours rather than minutes.

Upvotes: 1

akrun

Reputation: 887183

Based on the for loop code, this may be a done with a join

library(data.table)
setDT(yearlyreturns)[monthlyreturns, c("yret", "monthcount") 
     := .(prod(change) -1, .N), on = .(ID), by = .EACHI]

Upvotes: 1

TarJae

Reputation: 78937

Something like this:

library(dplyr)

df %>% 
  mutate(monthly_value = paste("Product of", str_replace(monthly_value, 'Value\\s\\d', 'Values'))) %>% 
  group_by(ID, monthly_value) %>% 
  summarise()

  ID       monthly_value                         
  <chr>    <chr>                                 
1 1 - 1997 Product of Monthly Values in Year 1997
2 2 - 1997 Product of Monthly Values in Year 1997

data:

structure(list(ID = c("1 - 1997", "1 - 1997", "1 - 1997", "2 - 1997", 
"2 - 1997", "2 - 1997"), monthly_value = c("Monthly Value 1 in Year 1997", 
"Monthly Value 2 in Year 1997", "Monthly Value 3 in Year 1997", 
"Monthly Value 1 in Year 1997", "Monthly Value 2 in Year 1997", 
"Monthly Value 3 in Year 1997")), class = "data.frame", row.names = c(NA, 
-6L))

Upvotes: 1

How can I use a vectorised function to multiply all values in a different data frame for a given ID in R?

Answers (3)

Related Questions