Reputation: 16637
I'd like to use plyr to calculate multiple empirical cumulative distribution functions using ecdf()
, and then apply those functions appropriately to entries in a data frame. For instance:
# Use the diamonds dataset in ggplot2
library(diamonds)
library(plyr)
# Calculate an ecdf for each combination of cut and color
all_ecdfs <- dlply(diamonds, c("cut", "color"), function(x) ecdf(x$carat))
# Make a dataset of specific diamonds, which I want to compare to the larger set
# My particular subset of diamonds
my_diamonds <- ddply(diamonds, c("cut", "color"), summarise,
my.carat=runif(n=1, min=0.5, max=1))
If I were to do this manually, it would look something like this:
# Use the ecdf for the first entry: cut=="Fair" and color=="D"
my_diamonds$percentile <- NA
my_diamonds$percentile[my_diamonds$cut=="Fair" & my_diamonds$color=="D"] <-
all_ecdfs[["Fair.D"]](my_diamonds$my.carat[my_diamonds$cut=="Fair" & my_diamonds$color=="D"])
Seems like there should be some way to use ldply
or lapply
to do this automatically, but I can't figure it out.
Upvotes: 1
Views: 61
Reputation: 24965
Here's how I would do it using dplyr
to make the ecdfs, and vectorizing to get the values for your data.
#get ecdfs
library(dplyr)
z <- diamonds %>% group_by(cut, color) %>%
summarise(x = list(ecdf(carat)))
Now you have a dataframe z
with the functions in a list in column x
.
Call the function on our data. We go by row, and get the matching cut and color, then call the function on carat:
z$x[z$cut == my_diamonds$cut & z$color == my_diamonds$color][[1]](my_diamonds$my.carat)
Upvotes: 1