How to obtain results similar to the SpATS package using mgcv?

Question

I am trying to reproduce the results of the SpATS package using the mgcv package. More specifically, I am trying to obtain the adjusted means for the variable "gen" after correcting for the spatial trend. The variables "row" and "col" are the coordinates of the plots in a rectangular grid. The plots are placed in the field in 3 adjacent blocks represented by the variable "rep". Every "gen" is replicated at least once per block.

# wheatdata from SpATS
load(
  system.file("data", "wheatdata.rda", package = "SpATS")
)

dt <- dplyr::select(
  wheatdata,
  rep,
  row, col,
  gen = geno,
  yield
)

str(dt)

'data.frame':   330 obs. of  5 variables:
 $ rep  : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ row  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ col  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ gen  : Factor w/ 107 levels "1","2","3","4",..: 4 10 17 16 21 32 33 34 72 74 ...
 $ yield: int  483 526 557 564 498 510 344 600 466 370 ...

The models

fit_psanova <- SpATS::SpATS(
  "yield",
  genotype = "gen",
  fixed = ~rep,
  spatial = ~ SpATS::PSANOVA(row, col),
  data = dt,
  control = SpATS::controlSpATS(monitoring = 0)
)

fit_sap <- SpATS::SpATS(
  "yield",
  genotype = "gen",
  fixed = ~rep,
  spatial = ~ SpATS::SAP(row, col),
  data = dt,
  control = SpATS::controlSpATS(monitoring = 0)
)

# SpATS::SAP is the same as SOP::sop from the package SOP

fit_sop <- SOP::sop(
  yield ~ rep + gen + f(row, col),
  data = dt
)

# the closest I could achieve using mgcv::gam

fit_gam <- mgcv::gam(
  yield ~
    rep +
    gen +
    s(row, bs = "cr", m = 2) +
    s(col, bs = "cr", m = 2, k = 14) +
    ti(row, col, bs = c("cr", "cr"), m = c(2, 2), k = c(10, 14)),
  data = dt,
  method = "REML"
)

Predictions

dt_sap <- predict(
  fit_sap,
  newdata = dt
) |>
  dplyr::summarise(
    sap = mean(predicted.values),
    sap_se = mean(standard.errors),
    .by = gen
  )

dt_psanova <- predict(
  fit_psanova,
  newdata = dt
) |>
  dplyr::summarise(
    psanova = mean(predicted.values),
    psanova_se = mean(standard.errors),
    .by = gen
  )

pred_sop <- predict(
  fit_sop,
  newdata = dt,
  se.fit = TRUE
)

dt_sop <- dplyr::tibble(
  sop = pred_sop$fit,
  sop_se = pred_sop$se.fit
) |>
  dplyr::bind_cols(dt) |>
  dplyr::summarise(
    sop = mean(sop),
    sop_se = mean(sop_se),
    .by = gen
  )

dt_margineff <- marginaleffects::predictions(
  fit_gam,
  newdata = dt
) |>
  dplyr::as_tibble() |>
  dplyr::summarise(
    margineff = mean(estimate),
    margineff_se = mean(std.error),
    .by = gen
  )

dt_preds <- list(
  dt_sap,
  dt_psanova,
  dt_sop,
  dt_margineff
) |>
  purrr::reduce(dplyr::inner_join) |>
  dplyr::relocate(!dplyr::ends_with("se"))

For now, these predicted values and errors below are acceptable.

lower_fun <- function(data, mapping) {
  ggplot2::ggplot(data = data, mapping = mapping) +
    ggplot2::geom_point() +
    ggplot2::geom_abline(
      color = "red",
      linetype = "dashed"
    )
}

GGally::ggpairs(
  dt_preds,
  columns = c("sap", "psanova", "sop", "margineff"),
  lower = list(
    continuous = lower_fun
  )
)

GGally::ggpairs(
  dt_preds,
  columns = c(
    "sap_se", "psanova_se", "sop_se", "margineff_se"
  ),
  lower = list(
    continuous = lower_fun
  )
)

Adjusted Means

?SpATS::predict.SpATS

Details This function allows to produce predictions, either specifying: (1) the data frame on which to obtain the predictions (argument newdata), or (2) those variables that define the margins of the multiway table to be predicted (argument which). In the first case, all fixed components (including genotype when fixed) and the spatial coordinates must be present in the data frame. As for the random effects is concerned, they are excluded from the predictions when the value is missing in the data frame. In the second case, predictions are obtained for each combination of values of the specified variables that is present in the data set used to fit the model. For those variables not specified in the argument which, the following rules have been considered: (a) random factors and the spatial trend are ignored in the predictions, (b) for fixed numeric variables, the mean value is considered; and (c) for fixed factors, there are two possibilities according to argument 'predFixed': (c1) if predFixed = 'conditional', the reference level is used; and (c2) predFixed = 'marginal', predictions are obtained averaging over all levels of the fixed factor.

Value The data frame used for obtaining the predictions, jointly with the predicted values and the corresponding standard errors. The label “Excluded” has been used to indicate those cases where a covariate has been excluded or ignored for the prediction (as for instance the random effect).

References Welham, S., Cullis, B., Gogel, B., Gilmour, A., and Thompson, R. (2004). Prediction in linear mixed models. Australian and New Zealand Journal of Statistics, 46, 325 - 347.

# what I guess is just use predict on a grid like the one below
# and get the average of each gen

dt_new <- expand.grid(
  gen = unique(dt$gen),
  rep = unique(dt$rep),
  row = mean(dt$row),
  col = mean(dt$col)
)

# adjusted means

dt_sap <- predict(
  fit_sap,
  which = "gen",
  predFixed = "marginal"
) |>
  dplyr::select(
    gen,
    sap = predicted.values,
    sap_se = standard.errors
  )

dt_psanova <- predict(
  fit_psanova,
  which = "gen",
  predFixed = "marginal"
) |>
  dplyr::select(
    gen,
    psanova = predicted.values,
    psanova_se = standard.errors
  )

dt_margineff <- marginaleffects::predictions(
  fit_gam,
  newdata = marginaleffects::datagrid(
    newdata = dt,
    grid_type = "balanced"
  ),
  by = "gen"
) |>
  dplyr::select(
    gen,
    margineff = estimate,
    margineff_se = std.error,
  )

dt_emm <- emmeans::emmeans(fit_gam, "gen") |>
  dplyr::as_tibble() |>
  dplyr::select(
    gen,
    emm = emmean,
    emm_se = SE
  )

dt_ggeff <- ggeffects::predict_response(# emmeans::emmeans
  fit_gam,
  "gen",
  margin = "marginalmeans"
) |>
  dplyr::as_tibble() |>
  dplyr::select(
    gen = x,
    ggeff = predicted,
    ggeff_se = std.error
  )

dt_preds <- list(
  dt_sap,
  dt_psanova,
  dt_emm,
  dt_margineff,
  dt_ggeff
) |>
  purrr::reduce(dplyr::inner_join) |>
  dplyr::relocate(!dplyr::ends_with("se"))

But now the results seem to be shifted by some amount.

GGally::ggpairs(
  dt_preds,
  columns = c("sap", "psanova", "margineff", "emm", "ggeff"),
  lower = list(
    continuous = lower_fun
  )
)

GGally::ggpairs(
  dt_preds,
  columns = c(
    "sap_se", "psanova_se", "margineff_se", "emm_se", "ggeff_se"
  ),
  lower = list(
    continuous = lower_fun
  )
)

My questions are:

Q1 - Are the models actually very different?

Q2 - Is my prediction process too naive?

Q3 - Both?

How to obtain results similar to the SpATS package using mgcv?

The models

Predictions

Adjusted Means

My questions are:

Answers (0)

Related Questions