Reputation: 38740
I have a grouped/nested data frame on which I want to run an nls
model fit for each group. This code used to work fine on one machine. Running it on another machine results in an error.
The minimally reproducible example is this:
third_order_polynomial <- function(video_bitrate, a, b, c, d = 0) {
return(a * log10(video_bitrate)^3 + b * log10(video_bitrate)^2 + c * log10(video_bitrate) + d)
}
problematic_data = read.csv(text="video_name,video_target_bitrate,video_height,video_width,video_frame_rate,video_bitrate,video_duration,video_size,video_bitrate_log,score
water_netflix_200kbps_360p_59.94fps_vp9.mkv,200,360,640,59.94,201.18,9.994,0.2454337060546875,2.3035848038124636,1.31034482758621
water_netflix_750kbps_360p_59.94fps_vp9.mkv,750,360,640,59.94,735.98,9.994,0.8978740380859375,2.866866012696663,1.9655172413793105
water_netflix_2000kbps_720p_59.94fps_vp9.mkv,2000,720,1280,59.94,1737.01,9.994,2.119101311035156,3.239802318695986,2.68965517241379
water_netflix_7500kbps_1080p_59.94fps_vp9.mkv,7500,1080,1920,59.94,7374.88,9.994,8.9971375390625,3.8677549581062247,3.6551724137931
water_netflix_15000kbps_1080p_59.94fps_vp9.mkv,15000,1080,1920,59.94,14738.57,9.994,17.98062360595703,4.168455348432381,4.20689655172414
water_netflix_40000kbps_2160p_59.94fps_vp9.mkv,40000,2160,3840,59.94,37534.27,9.994,45.790709763183585,4.574427973737644,4.48275862068965
")
nls(
formula = score ~ third_order_polynomial(video_bitrate, a, b, c, d),
data = problematic_data,
start = list(a = 1, b = 1, c = 1, d = 0),
lower = list(a = -1, b = -1, c = -5, d = 0),
upper = list(a = 5, b = 5, c = 5, d = 5),
algorithm = "port"
)
#> Error in nls(formula = score ~ third_order_polynomial(video_bitrate, a, : Convergence failure: singular convergence (7)
Created on 2022-05-06 by the reprex package (v2.0.1)
The error appears under Ubuntu 20.04 using R 4.2.0 x86_64-pc-linux-gnu.
The error does not occur under macOS 12.3.1 using R 4.2.0 aarch64-apple-darwin20 (64-bit).
is this a bug in the stats
package? Some numerical instability? I guess I will create a bug report but I'm wondering how I could prevent this.
Note that the start parameters have been chosen to get a good fit for the entire data set. The whole code is here (link to the particular revision that can also be reproducibly run via simply sourcing the script).
Upvotes: 1
Views: 129
Reputation: 269860
1) This is a linear model, i.e. linear in the coefficients, so we can use lm.
fm3 <- lm(formula = score ~ poly(log10(video_bitrate), 3), problematic_data)
2) or use raw = TRUE so that the coefficients correspond to those in the question; fm3 and fm3r give the same predictions and are only parameterized differently. That is all.equal(fitted(fm3), fitted(fm3r))
is TRUE
.
fm3r <- lm(formula = score ~ poly(log10(video_bitrate), 3, raw = TRUE), problematic_data)
3) Also note that the model is not significantly different than the simpler score ~ poly(log10(video_bitrate), 1)
model which is equivalent to score ~ log10(video_bitrate)
modulo parametrization.
fm1 <- lm(formula = score ~ poly(log10(video_bitrate), 1), data = problematic_data)
anova(fm1, fm3)
giving:
Analysis of Variance Table
Model 1: score ~ poly(log10(video_bitrate), 1)
Model 2: score ~ poly(log10(video_bitrate), 3)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 4 0.079747
2 2 0.012857 2 0.066891 5.2028 0.1612
graphics--
plot(score ~ log10(video_bitrate), problematic_data)
lines(fitted(fm3) ~ log10(video_bitrate), problematic_data)
lines(fitted(fm1) ~ log10(video_bitrate), problematic_data, lty = 2)
legend("topleft", c("fm3", "fm1"), lty = 1:2)
Revised to include log10 which had been omitted. Added graphics. Removed some output for brevity.
Upvotes: 1
Reputation: 226577
Your data are well-behaved, but nevertheless you're fitting 4 parameters to only 6 data points, so this kind of problem is not surprising. It's extremely common to get differences in convergence between different platforms/versions of R/etc.: tiny changes (compiler versions, compiler optimization flags, etc.) can easily flip a fitting attempt from just-barely-working to just-barely-failing.
Your starting values may be generally appropriate, but they're pretty far off for this data set (see plot below).
Do you need to use nls
, or even non-linear fitting?
lm(score ~ poly(log10(video_bitrate), 3), data = problematic_data)
(this uses an orthogonal polynomial basis, which will be most numerically stable; you can probably get away with poly(log10(video_bitrate), 3, raw = TRUE)
which will give you interpretable parameters). This doesn't easily allow you to incorporate constraints on the parameters, but for this particular data set the constraints don't seem to be binding.
nls
is notoriously opaque: see here:A return with IV(1) = 7 occurs if a more favorable stopping test is not satisfied and if the algorithm thinks
f(x) − min{f(y): ||D(y−x)|| ≤ V(LMAXS)} < V(SCTOL)*|f(x)|,
where D is given by (4.1). When this test is satisfied, it appears that x has too many degrees of freedom — and you should ponder whether f was properly formulated. Default = max{ 10^(−10) , MACHEP 2 / 3 }
I don't think it's possible to adjust this tolerance from the R level. If you don't use algorithm = "port"
you won't see this error (but you might be discarding a meaningful warning).
nls.lm()
from the minpack.lm
package.optim
or nlminb
(you have to write an objective function that computes the residual sum of squares yourself).library(ggplot2)
ggplot(problematic_data, aes(log10(video_bitrate), score)) +
geom_point() +
geom_smooth(method = "lm",
formula = y ~poly(x, 3)) +
## starting values
geom_function(fun = function(x) x^3 + x^2 + x, colour = "red") +
scale_y_log10()
Upvotes: 2