ggplot2, fitting data with log2 or log10 doesn't affect the plot

Question

I wanted a display a geom_smooth with a natural log and this code works fine:

    df <- iris
iris_logplot <- ggplot(df, aes(Sepal.Length, Sepal.Width, colour = Species))

iris_logplot + stat_summary(fun.y =median, geom = "point") + stat_summary(fun.data = mean_cl_boot, aes(group = Species), geom = "errorbar", width = 0.2) + 
  geom_smooth(method="lm", formula=y~log(x))

now I want to display a geom_smooth with a log whose base is 2 and I apply this code:

df <- iris
iris_logplot <- ggplot(df, aes(Sepal.Length, Sepal.Width, colour = Species))

iris_logplot + stat_summary(fun.y =median, geom = "point") +
  stat_summary(fun.data = mean_cl_boot, aes(group = Species), geom = "errorbar", width = 0.2) + geom_smooth(method="lm", formula=y~log2(x))

Why the plots are the same?

Thanks

Gregor Thomas · Accepted Answer

The lines are the same because multiplying a feature in a linear model by a constant does not change the fit, the coefficients are just divided by the same constant. The "change of base" formula tells us that log_b(x) = log_a(x) / log_a(b).

We can verify this by examining the models:

m_log_e = lm(Sepal.Width ~ log(Sepal.Length) * Species, data = iris)
m_log_2 = lm(Sepal.Width ~ log2(Sepal.Length) * Species, data = iris)

summary(m_log_e)
# Call:
# lm(formula = Sepal.Width ~ log(Sepal.Length) * Species, data = iris)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -0.71398 -0.15310 -0.00419  0.16595  0.60237 
# 
# Coefficients:
#                                     Estimate Std. Error t value Pr(>|t|)    
# (Intercept)                          -2.9663     0.8872  -3.343 0.001055 ** 
# log(Sepal.Length)                     3.9760     0.5512   7.214 2.86e-11 ***
# Speciesversicolor                     2.3355     1.1899   1.963 0.051595 .  
# Speciesvirginica                      3.0464     1.1639   2.617 0.009807 ** 
# log(Sepal.Length):Speciesversicolor  -2.0626     0.7087  -2.910 0.004186 ** 
# log(Sepal.Length):Speciesvirginica   -2.4373     0.6811  -3.579 0.000471 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.272 on 144 degrees of freedom
# Multiple R-squared:  0.6237,  Adjusted R-squared:  0.6106 
# F-statistic: 47.73 on 5 and 144 DF,  p-value: < 2.2e-16

summary(m_log_2)
# Call:
# lm(formula = Sepal.Width ~ log2(Sepal.Length) * Species, data = iris)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -0.71398 -0.15310 -0.00419  0.16595  0.60237 
# 
# Coefficients:
#                                      Estimate Std. Error t value Pr(>|t|)    
# (Intercept)                           -2.9663     0.8872  -3.343 0.001055 ** 
# log2(Sepal.Length)                     2.7560     0.3820   7.214 2.86e-11 ***
# Speciesversicolor                      2.3355     1.1899   1.963 0.051595 .  
# Speciesvirginica                       3.0464     1.1639   2.617 0.009807 ** 
# log2(Sepal.Length):Speciesversicolor  -1.4297     0.4913  -2.910 0.004186 ** 
# log2(Sepal.Length):Speciesvirginica   -1.6894     0.4721  -3.579 0.000471 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.272 on 144 degrees of freedom
# Multiple R-squared:  0.6237,  Adjusted R-squared:  0.6106 
# F-statistic: 47.73 on 5 and 144 DF,  p-value: < 2.2e-16

Comparing the summaries, you can convince yourself that the fits are the same - the residuals are the same, the statistics are the same, the intercepts are the same, the only difference are the coefficients for terms including Sepal.Length. We can divide the coefficients:

coef(m_log_e) / coef(m_log_2)
#                         (Intercept)                   log(Sepal.Length)                   Speciesversicolor                    Speciesvirginica 
#                            1.000000                            1.442695                            1.000000                            1.000000 
# log(Sepal.Length):Speciesversicolor  log(Sepal.Length):Speciesvirginica 
#                            1.442695                            1.442695

And see that the terms involving Sepal.Length are off by a fixed ratio. And what is that ratio?

1 / log(2)
# [1] 1.442695

It is 1 /log(2), because of the change of base formula referenced at the start of this answer.

ggplot2, fitting data with log2 or log10 doesn't affect the plot

Answers (1)

Related Questions

ggplot2, fitting data with log2 or log10 doesn&#39;t affect the plot

Answers (1)

Related Questions

ggplot2, fitting data with log2 or log10 doesn't affect the plot