Reputation: 466
I am trying to rewrite something similar to the following SAS optimization code in Python. The goal of the code is to find parameters for a continuous (in this case normal) distribution that best fits some empirical data. I am familiar with Python but very new to SAS.
proc nlin data=mydata outest=est METHOD=MARQUARDT SMETHOD=GOLDEN SAVE;
parms m = -1.0 to 1.0 by 0.1
s = 1.0 to 2.0 by 0.01;
bounds -10<m<10;
bounds 0<s<10;
model NCDF = CDF('NORMAL',XValue,m,s);
run;
In Python, I have something set up like this:
import pandas as pd
from scipy import stats
from scipy.optimize import minimize
def distance_empirical_fitted_cdf( params: list, data: pd.Series ):
m,s = params
empirical_cdf = ( data / 100 ).cumsum()
cost = 0
for point in range( 10 ):
emprical_cdf_at_point = empirical_cdf.iloc[ point ]
fitted_cdf_at_point = stats.norm.cdf( x = point, loc = m, scale = s )
cost += ( fitted_cdf_at_point - empirical_cdf_at_point ) ** 2
return cost
result = minimize( distance_empirical_fitted_cdf, x0=[0,1.5], args=(distribution),
bounds=[(-10,10),(0,10)] )
fitted_m, fitted_s = result.x
The code I have now gets me fairly close to the existing code's output in most cases, but not in all. Ideally, I could get them to match or be as close as possible and understand why they don't.
As far as I can tell, there are two sources of discrepancy. First, the SAS code is able to take a set of possible starting values (in this case a range from -1 to 1 for m
and 0 to 10 for s
) to initialize the parameters. Is there an equivalent of this in Python?
Second, the SAS code is specifically using the Marquardt optimization method and the Golden step size search method. The only Python code I could find referencing the Marquardt method is scipy.optimize.least_squares with method="lm", but this doesn't support bounds (and is much further off compared to scipy.optimize.minimize when I try without bounds).
The only Python code I could find referencing the golden step size search method is scipy.optimize.golden, but the documentation says that this is for minimizing functions of one variable and doesn't seem to support bounds either.
Any insight on getting the Python output closer to the SAS output would be greatly appreciated, thanks!
Upvotes: 1
Views: 313
Reputation: 1121
Not an answer, as its still inconclusive as to why (or how) the two sets of code produce different results. So this will likely change as more info is introduced. But in the meantime here are observations that might be useful, but is too much to all fit into the comments section.
Algorithm
The Marquardt Method as called by SAS is more commonly referred to as the Levenberg-Marquardt algorithm (atleast to me), and abbreviated LMA or LM.
Bounds
The math of LMA is not defined to handle bounds, which is why no bound options are provided in scipy.
The MINPACK-1 implementation used in
scipy.optimize.leastsq
for the Levenberg-Marquardt algorithm does not explicitly support bounds on parameters, and expects to be able to fully explore the available range of values for any Parameter. Simply placing hard constraints (that is, resetting the value when it exceeds the desired bounds) prevents the algorithm from determining the partial derivatives, and leads to unstable results.
The discrepancy in results may arise from here, depending on how SAS decides to handle the provided bounds params. It may be the case that SAS is ignoring the provided bound values, re-running until a solution within the bounds are found, or using other methods entirely.
Another possible cause is that a better solution exists outside of your provided bounds. Then either SAS limits its solutions to within the bounds but python doesn't, or vice versa. This results in one code returning a local minima within bounds, but the other returning a better (possibly global) minima out of bounds.
However, I can't find where (the SAS docs for NLIN) explain how this is handled, so this is still inconclusive.
Step-Size Search
Note that the NLIN procedure's SMETHOD
is used to search for an optimal step size.
It's unclear reading from the SAS docs on SMETHOD
, what the "step size" is exactly,
but I believe in this context it could refer to the "damping parameter" lambda
.
As this parameter is quite important to the performance of LMA, different ways of determining the parameter could also affect convergence (and thus final results.)
Again, all of this depends on how the two results differ.
If the two results are significantly different (completely different output CDF), then chances are only one CDF would match the actual data. The code that doesn't match the actual data is probably doing something wrong, and needs to be scrutinized.
Upvotes: 1