Reputation: 13
I am fitting a wave height dataset (m) to a generalized pareto distribution. I already selected a threshold at 0.15m. I'm transitioning from Matlab to R so I wanted to compare the results from both programs. I knew that both programs wouldn't provide the same exact answer but I was expecting to get results somewhat close to each other.
Of course, for both fittings, I used the same data and the same threshold. Both methods estimate the parameters by MLE. In Matlab, the fitdist function does not estimate the threshold. It is assumed to be known (which in this case is 0.15m), and subtracted from data before calling it (which I already did). While in R the threshold is part of the input.
The functions I used are:
Matlab:
fitdist(data_above_threshold,'Generalized Pareto')
R:
fevd(data,threshold =0.15,type = 'GP',method = "MLE")
. (extRemes Package)
In Matlab I get:
Shape:-1.0301 Scale:0.7534
while in R I get Shape:-0.5848505 Scale:0.3620191
Here I attach the data so you can see the results for yourselves:
c(0.194101337780203, 0.289791483274648, 0.313940773547535, 0.184577674010614,
0.102266008573448, 0.045464156804826, 0.0486387113946889, 0.143761972140945,
0.267342847246331, 0.242966803074167, 0.069386693178437, 0.0099771715681416,
0.00736950172646855, 0.056121590070795, 0.0759625562574393, 0.145009118586962,
0.258045937376017, 0.379926158236833, 0.236844447793717, 0.0861664817248564,
0.0503393656392586, 0.156233436601121, 0.118138781522764, 4.44089209850063e-16,
4.44089209850063e-16, 4.44089209850063e-16, 4.44089209850063e-16,
0.0442170103588082, 4.44089209850063e-16, 4.44089209850063e-16,
4.44089209850063e-16, 4.44089209850063e-16, 0.0143988726040223,
0.148183673176825, 0.197729400168618, 0.0143988726040223, 4.44089209850063e-16,
4.44089209850063e-16, 4.44089209850063e-16, 0.026416829265647,
0.133558046673528, 0.209180472082052, 0.236050809146251, 0.0976175536382913,
0.0292512530065965, 0.101812500774896, 0.174260371593558, 0.219271020599832,
0.463144839271102, 0.579809720448572, 0.533211794147367, 0.354756475417204,
0.360085192050189, 0.632756755929504, 0.651577329569406, 0.413372358380034,
0.39353139219339, 0.343985665201597, 0.0630375839987112, 4.44089209850063e-16,
4.44089209850063e-16, 4.44089209850063e-16, 4.44089209850063e-16,
4.44089209850063e-16, 0.0486387113946889, 0.0434233717113424,
4.44089209850063e-16, 4.44089209850063e-16, 4.44089209850063e-16,
0.0566884748189849, 0.147730165378273, 0.175734271938852, 0.0827651732357175,
4.44089209850063e-16, 4.44089209850063e-16, 4.44089209850063e-16,
4.44089209850063e-16, 4.44089209850063e-16, 4.44089209850063e-16,
0.0385481628769098, 0.0378679011790819, 0.0463711724019298, 0.0200677200859207,
4.44089209850063e-16, 4.44089209850063e-16, 4.44089209850063e-16,
4.44089209850063e-16, 4.44089209850063e-16, 0.0956901454944461,
0.128909591738371, 0.141154302299272, 0.0459176646033779, 4.44089209850063e-16,
0.0285709913087686, 0.251696828196291, 0.642847304447283, 0.731394702114536,
0.603732256822183, 0.427997984883332, 0.635251048821539)
So, my questions are: If both use MLE to estimate the parameters, then why are the results so different? Then which fit is more reliable or which one do you recommend to use?
Upvotes: 0
Views: 403
Reputation: 132706
I believe the issue is that your data hints at some multimodality and Matlab is using suboptimal starting values.
An empirical density plot of your data:
plot(density(x, bw = "SJ"))
First let's plot the R fit:
library(SpatialExtremes)
library(extRemes)
plot(fevd(x,threshold =0.15,type = 'GP',method = "MLE"))
The lower-left plot shows that we have achieved a decent fit. (Note that a different bandwidth than above is used for the kernel density estimate in this plot.)
Now we use your Matlab result as starting values:
fevd(x,threshold =0.15,type = 'GP',method = "MLE",
initial = list(shape = -1.0301, scale = 0.7534))
#fevd(x = x, threshold = 0.15, type = "GP", method = "MLE",
# initial = list(shape = -1.0301, scale = 0.7534))
#
#[1] "Estimation Method used: MLE"
#
#
# Negative Log-Likelihood Value: -18.74149
#
#
# Estimated parameters:
# scale shape
# 0.6242056 -1.0736348
#
# Standard Error Estimates:
# scale shape
#0.000000020 0.001368606
#
# Estimated parameter covariance matrix.
# scale shape
#scale 4.000000e-16 -2.598691e-17
#shape -2.598691e-17 1.873082e-06
#
# AIC = -33.48297
#
# BIC = -30.5515
#Warning messages:
#1: In log(z) : NaNs produced
#2: In log(z) : NaNs produced
plot(fevd(x,threshold =0.15,type = 'GP',method = "MLE",
initial = list(shape = -1.0301, scale = 0.7534)))
#Warning messages:
#1: In log(z) : NaNs produced
#2: In log(z) : NaNs produced
We can clearly see that the fit is worse and we have a false convergence or local minimum issue. The warnings further reduce our confidence in this fit. I suggest that you try Matlab with different starting values.
Upvotes: 1