Reputation: 60
I have a simple Table:
ID|Value
1|10
1|20
1|-5
2|25
3|2
3|15
4|8
5|18
6|33
6|5
6|50
Actual I use this code:
for (row in 1:nrow(Table)) {
ID <- Table[row, 1]
Value <- Table[row, 2]
if ( oldID == ID) {
currentValue <- currentValue * ((100 - Value)/100) }
else {
addrow <- data.frame(oldID, currentValue)
PriceRR <- rbind(PriceRR, addrow)
oldID <- ID
currentValue <- 100 - Value
}
}
To allocated a discount for a later DAX Value in Power BI.
But it slow as hell. So I want to parallelize it.
daply might do the work. But I do not know the inner workings of it.
So basically what I need.
Split table in sets by group of ID.
Set1 1,10 1,20 1,5
Set2 2,25
Set3 3,2 3,15
.
.
.
Apply function to Sets parallel.
First call of function in set, initialize currentValue <- 100
after
currentValue <- currentValue * ((100 - Value)/100)
For Set1.1 90 <- 100 * ((100 - 10)/100)
For Set1.2 72 <- 90 * ((100 - 20)/100)
For Set1.3 68,4 <- 72 * ((100 - 5)/100)
It should return ID=1 Value=68,4
I need to know, is it possible to make a variable persistent in memory for the duration of execute a function an set, as long it lives?
Will daply or a other function create a new working thread to apply it on a set?
I am a R beginner and must jump right in the inner working of the R environment. :-)
Sven
Upvotes: 3
Views: 365
Reputation: 24079
Your original script is slow for a couple of reason. First you are looping through every element in your initial table and not taking advantage of the vectorized nature of R. Second, there is a rbind
function within the loop. Binding is a slow process, especially as the object size grows.
It looks likes the objective is a cumulative product of the the value column grouped by the ID column.
Here is a base R solution using the split, apply and merge strategy.
Table <-structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L, 6L, 6L,
6L), Value = c(10L, 20L, -5L, 25L, 2L, 15L, 8L, 18L, 33L, 5L,
50L)), class = "data.frame", row.names = c(NA, -11L))
#Create column for the ((100 - Value)/100) factor
Table$factor<- ((100 - Table$Value)/100)
#split by ID
dfs<-split(Table, Table$ID)
currentValue<-sapply(dfs, function(x){
#find the cumulative product of the factor column
product<-cumprod(x$factor)
#return the last value fron the cumprod
return(100*product[length(product)])
})
#create the final answer
PriceRR<-data.frame(oldID=as.integer(names(dfs)), currentValue)
PriceRR
oldID currentValue
1 1 75.600
2 2 75.000
3 3 83.300
4 4 92.000
5 5 82.000
6 6 31.825
This script is using the cumprod
function which is vectorized, thus very fast. Also the above script avoids the slow operation of continuing to growing the final dataframe.
Upvotes: 1
Reputation: 887028
An option with reduce
from purrr
library(dplyr)
library(purrr)
data %>%
group_by(ID) %>%
summarise(Result = reduce(Value, ~ .x * (100 -.y)/100, .init = 100))
# A tibble: 6 x 2
# ID Result
#* <int> <dbl>
#1 1 68.4
#2 2 75
#3 3 83.3
#4 4 92
#5 5 82
data <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L, 6L, 6L,
6L), Value = c(10L, 20L, 5L, 25L, 2L, 15L, 8L, 18L, 33L, 5L,
50L)), class = "data.frame", row.names = c(NA, -11L))
Upvotes: 1
Reputation: 24790
Here's an approach with dplyr
and Reduce
from base R:
library(dplyr)
data %>%
group_by(ID) %>%
summarize(Result = Reduce(function(x,y) x * ((100 - y)/ 100),
Value, init = 100))
# A tibble: 6 x 2
ID Result
<int> <dbl>
1 1 68.4
2 2 75
3 3 83.3
4 4 92
5 5 82
6 6 31.8
Reduce
is a tricky function mostly because the documentation is terrible. Reduce
applies a function with two arguments to elements in a vector in succession with the previous value as the first argument and the current value as the second argument. You can set an initial value with init =
.
I notice in your explaination that your expected output for group 1 is 68.4
. This is only true if the value for row 3 is 5
rather than the -5
you posted. Since this was the only negative value in your data, I went ahead and changed it to 5
.
Data
data <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L, 6L, 6L,
6L), Value = c(10L, 20L, 5L, 25L, 2L, 15L, 8L, 18L, 33L, 5L,
50L)), class = "data.frame", row.names = c(NA, -11L))
Upvotes: 1