Gina Alexandra
Gina Alexandra

Reputation: 13

Fitting distributions in R and Matlab give very different results

I am fitting a wave height dataset (m) to a generalized pareto distribution. I already selected a threshold at 0.15m. I'm transitioning from Matlab to R so I wanted to compare the results from both programs. I knew that both programs wouldn't provide the same exact answer but I was expecting to get results somewhat close to each other.

Of course, for both fittings, I used the same data and the same threshold. Both methods estimate the parameters by MLE. In Matlab, the fitdist function does not estimate the threshold. It is assumed to be known (which in this case is 0.15m), and subtracted from data before calling it (which I already did). While in R the threshold is part of the input.

The functions I used are: Matlab: fitdist(data_above_threshold,'Generalized Pareto') R: fevd(data,threshold =0.15,type = 'GP',method = "MLE"). (extRemes Package)

In Matlab I get: Shape:-1.0301 Scale:0.7534 while in R I get Shape:-0.5848505 Scale:0.3620191

Here I attach the data so you can see the results for yourselves:

c(0.194101337780203, 0.289791483274648, 0.313940773547535, 0.184577674010614, 
0.102266008573448, 0.045464156804826, 0.0486387113946889, 0.143761972140945, 
0.267342847246331, 0.242966803074167, 0.069386693178437, 0.0099771715681416, 
0.00736950172646855, 0.056121590070795, 0.0759625562574393, 0.145009118586962, 
0.258045937376017, 0.379926158236833, 0.236844447793717, 0.0861664817248564, 
0.0503393656392586, 0.156233436601121, 0.118138781522764, 4.44089209850063e-16, 
4.44089209850063e-16, 4.44089209850063e-16, 4.44089209850063e-16, 
0.0442170103588082, 4.44089209850063e-16, 4.44089209850063e-16, 
4.44089209850063e-16, 4.44089209850063e-16, 0.0143988726040223, 
0.148183673176825, 0.197729400168618, 0.0143988726040223, 4.44089209850063e-16, 
4.44089209850063e-16, 4.44089209850063e-16, 0.026416829265647, 
0.133558046673528, 0.209180472082052, 0.236050809146251, 0.0976175536382913, 
0.0292512530065965, 0.101812500774896, 0.174260371593558, 0.219271020599832, 
0.463144839271102, 0.579809720448572, 0.533211794147367, 0.354756475417204, 
0.360085192050189, 0.632756755929504, 0.651577329569406, 0.413372358380034, 
0.39353139219339, 0.343985665201597, 0.0630375839987112, 4.44089209850063e-16, 
4.44089209850063e-16, 4.44089209850063e-16, 4.44089209850063e-16, 
4.44089209850063e-16, 0.0486387113946889, 0.0434233717113424, 
4.44089209850063e-16, 4.44089209850063e-16, 4.44089209850063e-16, 
0.0566884748189849, 0.147730165378273, 0.175734271938852, 0.0827651732357175, 
4.44089209850063e-16, 4.44089209850063e-16, 4.44089209850063e-16, 
4.44089209850063e-16, 4.44089209850063e-16, 4.44089209850063e-16, 
0.0385481628769098, 0.0378679011790819, 0.0463711724019298, 0.0200677200859207, 
4.44089209850063e-16, 4.44089209850063e-16, 4.44089209850063e-16, 
4.44089209850063e-16, 4.44089209850063e-16, 0.0956901454944461, 
0.128909591738371, 0.141154302299272, 0.0459176646033779, 4.44089209850063e-16, 
0.0285709913087686, 0.251696828196291, 0.642847304447283, 0.731394702114536, 
0.603732256822183, 0.427997984883332, 0.635251048821539)

So, my questions are: If both use MLE to estimate the parameters, then why are the results so different? Then which fit is more reliable or which one do you recommend to use?

Upvotes: 0

Views: 403

Answers (1)

Roland
Roland

Reputation: 132706

I believe the issue is that your data hints at some multimodality and Matlab is using suboptimal starting values.

An empirical density plot of your data:

plot(density(x, bw = "SJ"))

density plot

First let's plot the R fit:

library(SpatialExtremes)
library(extRemes)
plot(fevd(x,threshold =0.15,type = 'GP',method = "MLE"))

resulting plot

The lower-left plot shows that we have achieved a decent fit. (Note that a different bandwidth than above is used for the kernel density estimate in this plot.)

Now we use your Matlab result as starting values:

fevd(x,threshold =0.15,type = 'GP',method = "MLE", 
     initial = list(shape = -1.0301, scale = 0.7534))
#fevd(x = x, threshold = 0.15, type = "GP", method = "MLE", 
#    initial = list(shape = -1.0301, scale = 0.7534))
#
#[1] "Estimation Method used: MLE"
#
#
# Negative Log-Likelihood Value:  -18.74149 
#
#
# Estimated parameters:
#     scale      shape 
# 0.6242056 -1.0736348 
#
# Standard Error Estimates:
#      scale       shape 
#0.000000020 0.001368606 
#
# Estimated parameter covariance matrix.
#              scale         shape
#scale  4.000000e-16 -2.598691e-17
#shape -2.598691e-17  1.873082e-06
#
# AIC = -33.48297 
#
# BIC = -30.5515 
#Warning messages:
#1: In log(z) : NaNs produced
#2: In log(z) : NaNs produced
plot(fevd(x,threshold =0.15,type = 'GP',method = "MLE", 
          initial = list(shape = -1.0301, scale = 0.7534)))
#Warning messages:
#1: In log(z) : NaNs produced
#2: In log(z) : NaNs produced

second resulting plot

We can clearly see that the fit is worse and we have a false convergence or local minimum issue. The warnings further reduce our confidence in this fit. I suggest that you try Matlab with different starting values.

Upvotes: 1

Related Questions