Reputation: 5401
I'm trying to calculate the range minimum of a dataframe in R. The dataframe looks like this:
+-----+--------------+-----------+------+------+
| Key | DaysToEvent | PriceEUR | Pmin | Pmax |
+-----+--------------+-----------+------+------+
| AAA | 120 | 50 | 50 | 50 |
| AAA | 110 | 40 | 40 | 50 |
| AAA | 100 | 60 | 40 | 60 |
| BBB | ... | | | |
+-----+--------------+-----------+------+------+
So the range minimum price (Pmin
) holds the minimum price of that key, up to that point in time ( DaysToEvent
).
Here's my implementation:
for (i in 1:nrow(data)){
currentRecord <- data[i,]
if(currentRecord$Key != currentKey) {
# New key detected - reset pmin and pmax
pmin <- 100000
pmax <- 0
currentKey <- currentRecord$Key
}
if(currentRecord$PriceEUR < pmin) {
pmin <- currentRecord$PriceEUR
}
if(currentRecord$PriceEUR > pmax) {
pmax <- currentRecord$PriceEUR
}
currentRecord$Pmin <- pmin
currentRecord$Pmax <- pmax
# This line seems to be killing my performance
# but otherwise the data variable is not updated in
# global space
data[i,] <- currentRecord
}
This works - but is REALLY slow, only a couple per second. It works because I've sorted the data frame like so data = data[order(data$Key, -data$DaysToEvent), ]
. Reason for doing this, is because I was hoping to get a Big-O of nlog(n)
for the sorting and n
on the for loop. So I thought I'd be flying through this data, but I'm not AT ALL - takes hours.
How can I make this faster?
The previous approach is from my colleague - here in pseudo:
for (i in 1:nrow(data)) {
...
currentRecord$Pmin <- data[subset on the key[find the min value of the price
where DaysToEvent > currentRecord$DaysToEvent]]
...
}
Also works - but I think the order of this functions is way higher. n^2log(n)
if I'm correct and takes days. So I thought I was going to improve on that big time.
So I've tried to get my head around on all kinds of *apply
, by
functions and of course that's what you really want to use.
However - if I use by()
and then split on the key. Gets me pretty close. However, I cannot get around how I would get the range minimum / max.
I'm trying to think in functional paradigm but I'm stuck. Any help is appreciated.
Upvotes: 4
Views: 114
Reputation: 13304
[Original answer: dplyr]
You can solve this problem by using the dplyr
package:
library(dplyr)
d %>%
group_by(Key) %>%
mutate(Pmin=cummin(PriceEUR),Pmax=cummax(PriceEUR))
# Key DaysToEvent PriceEUR Pmin Pmax
# 1 AAA 120 50 50 50
# 2 AAA 110 40 40 50
# 3 AAA 100 60 40 60
# 4 BBB 100 50 50 50
where d
is supposed to be your data set:
d <- data.frame(Key=c('AAA','AAA','AAA','BBB'),DaysToEvent = c(120,110,100,100),PriceEUR = c(50,40,60,50), Pmin = c(50,40,40,30), Pmax = c(50,50,60,70))
[Update: data.table]
Another approach is to use data.table
, which has a quite spectacular performance:
library(data.table)
DT <- setDT(d)
DT[,c("Pmin","Pmax") := list(cummin(PriceEUR),cummax(PriceEUR)),by=Key]
DT
# Key DaysToEvent PriceEUR Pmin Pmax
# 1: AAA 120 50 50 50
# 2: AAA 110 40 40 50
# 3: AAA 100 60 40 60
# 4: BBB 100 50 50 50
[Update 2: base R]
Here is another approach in the case that you'd like to use only base R for some reason:
d$Pmin <- unlist(lapply(split(d$PriceEUR,d$Key),cummin))
d$Pmax <- unlist(lapply(split(d$PriceEUR,d$Key),cummax))
Upvotes: 5