find the best curve to fit a family of curves using R

Question

I have a process which generates a set of numbers (< 1) at each run. the process is run till the cumulative sum of the numbers generated equals 1. So each set might have different count of the numbers generated. But the sum total of each set is 1.

There are thousands of runs of the process. I can plot the runs with cum-sum of the numbers, there are multiple curves with each curve corresponding to a run.

For 50 runs:

For 2000 runs:

As you can see, the curves have a definite shape and its not a random output. I want to find the best fit equation to this group of curves.

How can I do this in R? Most of the best fit curve solutions are for fitting against a single set of data.

here is the code to generate sample data with 5 runs.

run_group <- c('A_group', 'A_group', 'A_group', 'A_group', 'A_group', 'A_group', 'A_group', 'A_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'C_group', 'C_group', 'C_group', 'C_group', 'C_group', 'C_group', 'C_group', 'D_group', 'D_group', 'D_group', 'D_group', 'D_group', 'D_group', 'D_group', 'D_group', 'D_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group')

cumul <- c(0.052631579, 0.263157895, 0.342105263, 0.710526316, 0.868421053, 0.894736842, 0.973684211, 1, 0.0078125, 0.015625, 0.0390625, 0.0546875, 0.0703125, 0.1015625, 0.1640625, 0.3203125, 0.4921875, 0.734375, 0.875, 0.96875, 0.9921875, 1, 0.073529412, 0.220588235, 0.323529412, 0.507352941, 0.727941176, 0.970588235, 1, 0.006134969, 0.055214724, 0.141104294, 0.190184049, 0.349693252, 0.595092025, 0.858895706, 0.969325153, 1, 0.005649718, 0.011299435, 0.016949153, 0.039548023, 0.073446328, 0.124293785, 0.299435028, 0.451977401, 0.559322034, 0.728813559, 0.81920904, 0.960451977, 1)

time_diff_to_complete <- c(-155, -140, -125, -110, -95, -80, -65, -50, -270, -210, -195, -180, -165, -150, -135, -120, -105, -90, -75, -60, -45, -30, -130, -115, -100, -85, -70, -55, -40, -175, -160, -130, -115, -100, -85, -70, -55, -40, -225, -210, -195, -180, -150, -135, -120, -105, -90, -75, -60, -45, -30)

sample_data <- data.frame(run_group, cumul, time_diff_to_complete, stringsAsFactors=FALSE)

G. Grothendieck · Accepted Answer

Just stack them. The curves look like Gaussian cdf's so we fit to pnorm. (The logistic cdf, plogis, would likely also work.)

x <- sample_data$time_diff_to_complete
o <- order(x) 
st <- list(a = mean(x), b = sd(x))

fm <- nls(cumul ~ pnorm(time_diff_to_complete, a, b), sample_data[o, ], start = st)

plot(cumul ~ time_diff_to_complete, sample_data)
lines(fitted(fm) ~ time_diff_to_complete, sample_data[o, ])

The fit looks like this:

find the best curve to fit a family of curves using R

Answers (1)

Related Questions