Sven Ungerecht
Sven Ungerecht

Reputation: 60

R split in Group for parallel execution

I have a simple Table:

ID|Value
1|10
1|20
1|-5
2|25
3|2
3|15
4|8
5|18
6|33
6|5
6|50

Actual I use this code:

for (row in 1:nrow(Table)) {
   ID <- Table[row, 1]
   Value <- Table[row, 2]
   if ( oldID == ID) {
      currentValue <- currentValue * ((100 - Value)/100) }
   else {
      addrow <- data.frame(oldID, currentValue)
      PriceRR <- rbind(PriceRR, addrow)
      oldID <- ID
      currentValue <- 100 - Value
      }
   }

To allocated a discount for a later DAX Value in Power BI.

But it slow as hell. So I want to parallelize it.

daply might do the work. But I do not know the inner workings of it.

So basically what I need.

Split table in sets by group of ID.

Set1 1,10 1,20 1,5
Set2 2,25
Set3 3,2 3,15
.
.
.

Apply function to Sets parallel.

First call of function in set, initialize currentValue <- 100

after

currentValue <- currentValue * ((100 - Value)/100)
For Set1.1 90 <- 100 * ((100 - 10)/100)
For Set1.2 72 <- 90 * ((100 - 20)/100)
For Set1.3 68,4 <- 72 * ((100 - 5)/100)
It should return ID=1 Value=68,4

I need to know, is it possible to make a variable persistent in memory for the duration of execute a function an set, as long it lives?

Will daply or a other function create a new working thread to apply it on a set?

I am a R beginner and must jump right in the inner working of the R environment. :-)

Sven

Upvotes: 3

Views: 365

Answers (3)

Dave2e
Dave2e

Reputation: 24079

Your original script is slow for a couple of reason. First you are looping through every element in your initial table and not taking advantage of the vectorized nature of R. Second, there is a rbind function within the loop. Binding is a slow process, especially as the object size grows.

It looks likes the objective is a cumulative product of the the value column grouped by the ID column.

Here is a base R solution using the split, apply and merge strategy.

Table <-structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L, 6L, 6L, 
6L), Value = c(10L, 20L, -5L, 25L, 2L, 15L, 8L, 18L, 33L, 5L, 
50L)), class = "data.frame", row.names = c(NA, -11L))

#Create column for the ((100 - Value)/100) factor
Table$factor<- ((100 - Table$Value)/100)

#split by ID
dfs<-split(Table, Table$ID)

currentValue<-sapply(dfs, function(x){
  #find the cumulative product of the factor column
  product<-cumprod(x$factor)
  #return the last value fron the cumprod
  return(100*product[length(product)])
})

#create the final answer
PriceRR<-data.frame(oldID=as.integer(names(dfs)), currentValue)
PriceRR

  oldID currentValue
1     1       75.600
2     2       75.000
3     3       83.300
4     4       92.000
5     5       82.000
6     6       31.825

This script is using the cumprod function which is vectorized, thus very fast. Also the above script avoids the slow operation of continuing to growing the final dataframe.

Upvotes: 1

akrun
akrun

Reputation: 887028

An option with reduce from purrr

library(dplyr)
library(purrr)
data %>%
    group_by(ID) %>% 
    summarise(Result = reduce(Value, ~ .x * (100 -.y)/100, .init = 100))
# A tibble: 6 x 2
#     ID Result
#* <int>  <dbl>
#1     1   68.4
#2     2   75  
#3     3   83.3
#4     4   92  
#5     5   82  

data

data <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L, 6L, 6L, 
6L), Value = c(10L, 20L, 5L, 25L, 2L, 15L, 8L, 18L, 33L, 5L, 
50L)), class = "data.frame", row.names = c(NA, -11L))

Upvotes: 1

Ian Campbell
Ian Campbell

Reputation: 24790

Here's an approach with dplyr and Reduce from base R:

library(dplyr)
data %>%
  group_by(ID) %>%
  summarize(Result = Reduce(function(x,y) x * ((100 - y)/ 100),
                            Value, init = 100))
# A tibble: 6 x 2
     ID Result
  <int>  <dbl>
1     1   68.4
2     2   75  
3     3   83.3
4     4   92  
5     5   82  
6     6   31.8

Reduce is a tricky function mostly because the documentation is terrible. Reduce applies a function with two arguments to elements in a vector in succession with the previous value as the first argument and the current value as the second argument. You can set an initial value with init =.

I notice in your explaination that your expected output for group 1 is 68.4. This is only true if the value for row 3 is 5 rather than the -5 you posted. Since this was the only negative value in your data, I went ahead and changed it to 5.

Data

data <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L, 6L, 6L, 
6L), Value = c(10L, 20L, 5L, 25L, 2L, 15L, 8L, 18L, 33L, 5L, 
50L)), class = "data.frame", row.names = c(NA, -11L))

Upvotes: 1

Related Questions