Lili
Lili

Reputation: 587

BestNormalize gives misleading results?

I am wondering why I get different results every time I run this code:

# arcsinh transformation
(arcsinh_obj <- arcsinh_x(df$Var1))
# Box Cox's Transformation
(boxcox_obj <- boxcox(df$Var1))
# Yeo-Johnson's Transformation
(yeojohnson_obj <- yeojohnson(df$Var1))
# orderNorm Transformation
(orderNorm_obj <- orderNorm(df$Var1))
# Pick the best one automatically
(BNobject <- bestNormalize(df$Var1))
# Last resort - binarize
(binarize_obj <- binarize(df$Var1))

summary(df$Var1)
xx <- seq(min(12), max(56), length = 295)

plot(xx, predict(arcsinh_obj, newdata = xx), type = "l", col = 1, ylim = c(-4, 4),
     xlab = 'df$Var1', ylab = "g(df$Var1)")
lines(xx, predict(boxcox_obj, newdata = xx), col = 2)
lines(xx, predict(yeojohnson_obj, newdata = xx), col = 3)
lines(xx, predict(orderNorm_obj, newdata = xx), col = 4)

legend("bottomright", legend = c("arcsinh", "Box Cox", "Yeo-Johnson", "OrderNorm"), 
       col = 1:4, lty = 1, bty = 'n')

par(mfrow = c(2,2))
MASS::truehist(arcsinh_obj$x.t, main = "Arcsinh transformation", nbins = 100)
MASS::truehist(boxcox_obj$x.t, main = "Box Cox transformation", nbins = 100)
MASS::truehist(yeojohnson_obj$x.t, main = "Yeo-Johnson transformation", nbins = 100)
MASS::truehist(orderNorm_obj$x.t, main = "orderNorm transformation", nbins = 100)

par(mfrow = c(1,2))
MASS::truehist(BNobject$x.t, 
               main = paste("Best Transformation:", 
                            class(BNobject$chosen_transform)[1]), nbins = 100)
plot(xx, predict(BNobject, newdata = xx), type = "l", col = 1, 
     main = "Best Normalizing transformation", ylab = "g(x)", xlab = "x")

dev.off()
boxplot(log10(BNobject$oos_preds), yaxt = 'n')
axis(2, at=log10(c(.1,.5, 1, 2, 5, 10)), labels=c(.1,.5, 1, 2, 5, 10))

I even tried this every time I was re running the analysis in case his was actually affecting

 rm(list = ls())

Could you please give me a hand?

Thanks lil

Upvotes: 1

Views: 487

Answers (1)

petersonR
petersonR

Reputation: 49

You are likely getting different results because the bestNormalize() function uses repeated cross-validation (and doesn't automatically set the seed), so the results will be slightly different each time that's run.

Try setting the seed (e.g. set.seed(3)).

Or, you can tell the function not to perform repeated CV by setting out_of_sample = FALSE, or use leave-one-out CV by setting loo = TRUE.

Upvotes: 1

Related Questions