Reputation: 133
I have a continuous variable that goes from 0 to 1 (percentage data, including 0s), and I want to determine the best distribution to model it. I'm on R-Studio, data in question here. Note that about 27% of observations are 0, and I do plan on exploring zero inflation as I go.
I checked the histogram and ecdf (see below) to get an idea of what I'm dealing with. Fitdistrplus's gave me 'beta', while gamlss gave me a Pareto Type 2, which I'm not very familiar with.
I've determined the parameters of a beta distribution and fit it, used KS to test a few other distributions, but a stuck on that Pareto Type 2. The problem: all my atempts at estimating location and scale fail. As far as I can tell, that's because of the zeroes in the dataset. It works if I add a tiny amount to the entire dataset (i.e. 0.0001), but honestly I'm not sure that is a good solution and would make comparing it to anything else a living hell. I tried EnvStats, VGAM, CaDENCE, and all give me errors. So, I humbly come here in the hopes that someone can suggest another option for estimating the Pareto Type 2 parameters for that dataset.
Upvotes: 0
Views: 254
Reputation: 2223
You can consider the following approach :
library(DEoptim)
df <- read.csv("percentData.csv")
data <- unlist(df)
log_Lik <- function(data, param)
{
x <- data
k <- param[1]
s <- param[2]
log_Lik <- sum(log(k/(s + x) * (s / (s + x)) ^ k))
return(-log_Lik)
}
obj_Res <- DEoptim(fn = log_Lik, lower = c(0, 0), upper = c(1000, 1000), data = data, control = list(parallelType = 1))
obj_Res$optim$bestmem
Upvotes: 1