Reputation: 11
I am woking on a project to understand the distribution of different elevations in a landscape. This requires me to measure how many data points of each elevation there are in a raster (each pixel = one elevation value). Given my mapped area is large, I end up with very large frequency counts of each election. I have a distribution of data that looks something like this:
Elevation, Frequency
0, 89021
0.01, 56893
0.02, 39504
0.03, 35894
...
Now, I've gone ahead and calculated my own CDF for these values based on the frequency count of values. I am trying to fit a distribution to this CDF. I am aware of fitdist packages, but these seem to compute the CDF from raw measurements, and so would not work with my data unless I make a file with 89021 rows of 0, then 56893 rows of 0.01 etc...This is really impractical. Is there a way to fit a distribution to an already provided CDF? Thanks!
Additional Details:
My elevation values are:
elev = c(0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008,
0.009, 0.01, 0.011, 0.012, 0.013, 0.014, 0.015, 0.016, 0.017,
0.018, 0.019, 0.02, 0.021, 0.022, 0.023, 0.024)
The number of measurements of each elevation value are:
count = c(415348, 162928, 73967, 81997, 54291, 57883, 36105,
49516, 33736, 37557, 39057, 41544, 27860, 20015, 20367,
19158, 19960, 15577, 1452, 9888, 3851, 2238, 157)
I have taken my frequency measurements and converted them to % smaller than values. The result is a vector of:
cumulative = c(0.3357563, 0.4674630, 0.5272559, 0.5935401,
0.6374275, 0.6842186, 0.7134049, 0.7534324, 0.7807036, 0.8110637,
0.8426364, 0.8762194, 0.8987407, 0.9149203, 0.9313845, 0.9468713,
0.9630064, 0.9755984, 0.9869577, 0.9949509, 0.9980639, 0.9998731,
1.0000000, 1.0000000)
.
I wish to fit a distribution function to these values if possible.
Upvotes: 1
Views: 585
Reputation: 17577
Data for which you only know the number of data falling into intervals are censored data. In general, whatever the theoretical CDF, the likelihood function is a product of terms like
(probability of falling into the k`th interval)^(number of data in that interval)
where the probability is just
CDF(right end of k`th interval) - CDF(left end of k`th interval)
You can get the Weibull CDF for each endpoint from a function or just write it out. Given that the intervals are contiguous, the probability is
CDF(x[k + 1]) - CDF(x[k])
where the x[k] is the list of endpoints. (You'll need to be careful about the last interval.) So the likelihood function is
product ((CDF(x[k + 1]) - CDF(x[k]))^n[k], k, 0, nintervals - 1)
and the log likelihood is
sum (n[k]*log(CDF(x[k + 1]) - CDF(x[k])), k, 0, nintervals - 1)
It's probably not a bad idea to scale the log likelihood so that the factors are not so big. Scaling doesn't change the location of the maximum likelihood parameters so that's OK. I'll divide by (total n).
sum ((n[k]/(total n))*log(CDF(x[k + 1]) - CDF(x[k])), k, 0, nintervals - 1)
Now n[k]/(total n) = p[k] where p[k] is the proportion falling into the k'th interval. That gives the log likelihood a pleasing entropy-like form,
sum (p[k]*log(CDF(x[k + 1]) - CDF(x[k])), k, 0, nintervals - 1)
At this point you can plug in the Weibull CDF formula and apply a numerical minimizer to it. From looking at your data, it seems like scale > 1 and shape < 1, so maybe pick initial values scale = 2 and shape = 1/2 or something. (I'm looking at https://en.wikipedia.org/wiki/Weibull_distribution for that.)
EDIT: I gave it a try with an implementation of LBFGS in Maxima. No doubt there's something like that in Python or R. NOTE: the summation I showed above the log likelihood, which you want to maximize. However, many numerical routines want to minimize, so supply the negative of the log likelihood.
I replaced elev
with 100 times elev
, so the range of value is 0 to 2.4 instead of 0 to 0.024. I found that helps LBFGS find a minimum.
With initial guess of shape = 0.3, scale = 1, I found the approximate solution shape = 1.185, scale = 0.609. (Again, note the scale is 100 times the original scale.) Plotting the Weibull CDF seems to show a picture which agrees with the empirical CDF plot.
Finally, the approximation I mentioned can probably be implemented by replacing x[k]
with p[k]*x[k]
in the formulas for maximum likelihood estimation as shown on the wiki page. However, I wasn't able to prove that, after looking at the formulas for a bit.
Upvotes: 1