Abhishek Sourabh
Abhishek Sourabh

Reputation: 101

find the best curve to fit a family of curves using R

I have a process which generates a set of numbers (< 1) at each run. the process is run till the cumulative sum of the numbers generated equals 1. So each set might have different count of the numbers generated. But the sum total of each set is 1.

There are thousands of runs of the process. I can plot the runs with cum-sum of the numbers, there are multiple curves with each curve corresponding to a run.

For 50 runs: Output graph of 50 runs

For 2000 runs: enter image description here

As you can see, the curves have a definite shape and its not a random output. I want to find the best fit equation to this group of curves.

How can I do this in R? Most of the best fit curve solutions are for fitting against a single set of data.

here is the code to generate sample data with 5 runs.

run_group <- c('A_group', 'A_group', 'A_group', 'A_group', 'A_group', 'A_group', 'A_group', 'A_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'B_group', 'C_group', 'C_group', 'C_group', 'C_group', 'C_group', 'C_group', 'C_group', 'D_group', 'D_group', 'D_group', 'D_group', 'D_group', 'D_group', 'D_group', 'D_group', 'D_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group', 'E_group')

cumul <- c(0.052631579, 0.263157895, 0.342105263, 0.710526316, 0.868421053, 0.894736842, 0.973684211, 1, 0.0078125, 0.015625, 0.0390625, 0.0546875, 0.0703125, 0.1015625, 0.1640625, 0.3203125, 0.4921875, 0.734375, 0.875, 0.96875, 0.9921875, 1, 0.073529412, 0.220588235, 0.323529412, 0.507352941, 0.727941176, 0.970588235, 1, 0.006134969, 0.055214724, 0.141104294, 0.190184049, 0.349693252, 0.595092025, 0.858895706, 0.969325153, 1, 0.005649718, 0.011299435, 0.016949153, 0.039548023, 0.073446328, 0.124293785, 0.299435028, 0.451977401, 0.559322034, 0.728813559, 0.81920904, 0.960451977, 1)

time_diff_to_complete <- c(-155, -140, -125, -110, -95, -80, -65, -50, -270, -210, -195, -180, -165, -150, -135, -120, -105, -90, -75, -60, -45, -30, -130, -115, -100, -85, -70, -55, -40, -175, -160, -130, -115, -100, -85, -70, -55, -40, -225, -210, -195, -180, -150, -135, -120, -105, -90, -75, -60, -45, -30)

sample_data <- data.frame(run_group, cumul, time_diff_to_complete, stringsAsFactors=FALSE)

Upvotes: 0

Views: 412

Answers (1)

G. Grothendieck
G. Grothendieck

Reputation: 269586

Just stack them. The curves look like Gaussian cdf's so we fit to pnorm. (The logistic cdf, plogis, would likely also work.)

x <- sample_data$time_diff_to_complete
o <- order(x) 
st <- list(a = mean(x), b = sd(x))

fm <- nls(cumul ~ pnorm(time_diff_to_complete, a, b), sample_data[o, ], start = st)

plot(cumul ~ time_diff_to_complete, sample_data)
lines(fitted(fm) ~ time_diff_to_complete, sample_data[o, ])

The fit looks like this:

screenshot

Upvotes: 2

Related Questions