Find best predictors using GAM and having separate deviance for each predictor

Question

I need to determine the influence of soil properties (predictors) on soluble heavy metals (response variables) and quantify the proportion of deviance explained by each predictors.

The dataset contains 2000 points. The response variable is continuous and positive, but it is highly skewed to the right with very low values (not necessarily zero) and a few very high values.
The predictors are continuous and positive.

I am new to Stack Overflow and in this statistical approach, so I apologize in advance if my questions lack the basics.

The simple Spearman correlation provides information about the potential correlation between predictors and the response variable. However, is it true that the results from GAM or GLM are more robust than those from a Spearman correlation?

I attempted to use GAM with a Tweedie distribution. I'm wondering if I can extract the deviance explained by each predictor by simply running the GAM with a single predictor as follows:

> model <- gam(ZnmgKg ~ s(PJRC, k=10), data = lucas.dat,  method = 'REML', family= tw(link = "log"))

> summary(model)

Family: Tweedie(p=1.99) 
Link function: log 

Formula:
ZnmgKg ~ s(PJRC, k = 10)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.4782     0.0139   34.39   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
          edf Ref.df     F p-value    
s(PJRC) 4.795  5.857 179.3  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# R-sq.(adj) =  0.244   **Deviance explained = 31.5%**
-REML = 3145.5  Scale est. = 0.45613   n = 2348

> model <- gam(ZnmgKg ~ s(clay, k=10), data = lucas.dat,  method = 'REML', family= tw(link = "log"))

> summary(model)

Family: Tweedie(p=1.99) 
Link function: log 

Formula:
ZnmgKg ~ s(clay, k = 10)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.57645    0.01622   35.55   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
          edf Ref.df     F p-value    
s(clay) 1.512  1.872 62.03  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# R-sq.(adj) =  0.0249   **Deviance explained = 4.38%**
-REML =   3556  Scale est. = 0.61891   n = 2340

Is it statistically correct if I simply report in the manuscript that 31.5% of the deviance in ZnmgKg is explained by PJRC and 4.38% by clay, and similarly for the rest of the predictors?
I performed gam.check(model, rep=1000), and the tails in the "theoretical quantiles" are poor, and the "Response vs. Fitted Values" plot does not follow the 1:1 line (see figure below for clay). Does this indicate that the fitted model is not a good fit? Does it matter when I am not actually modeling but rather trying to understand the importance of each predictor? gam.check(model, rep=1000)

While there is an option to include all predictors in the model, it does not provide separate deviance values for each predictor (or I do not know how to extract them!). Is there a proper way to identify how much of the deviance is explained by each predictor?

> model <- gam(ZnmgKg ~ Plant + s(clay, k = 10) + s(pH_CaCl2, k = 10) + s(OC, k = 10) + s(PJRC, k = 10), data = lucas.dat,  method = 'REML', family= tw(link = "log"))
> #EVALUATING THE FITTED MODEL
> summary(model)

Family: Tweedie(p=1.99) 
Link function: log 

Formula:
ZnmgKg ~ Plant + s(clay, k = 10) + s(pH_CaCl2, k = 10) + s(OC, 
    k = 10) + s(PJRC, k = 10)

Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.342545   0.023584  14.524  < 2e-16 ***
PlantB20     0.136277   0.059755   2.281 0.022663 *  
PlantB30    -0.001029   0.051469  -0.020 0.984058    
PlantB40     0.198377   0.064552   3.073 0.002143 ** 
PlantB50     0.126916   0.051508   2.464 0.013812 *  
PlantB70     0.430686   0.060898   7.072 2.01e-12 ***
PlantB80     0.202794   0.067635   2.998 0.002743 ** 
PlantE00     0.137403   0.038144   3.602 0.000322 ***
PlantF40    -0.102245   0.066300  -1.542 0.123172    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
              edf Ref.df       F p-value    
s(clay)     5.962  7.041   8.748  <2e-16 ***
s(pH_CaCl2) 7.244  8.108  39.145  <2e-16 ***
s(OC)       5.870  7.023   9.410  <2e-16 ***
s(PJRC)     5.093  6.194 117.888  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.296   **Deviance explained = 44.1%**
-REML = 2942.1  Scale est. = 0.37912   n = 2340

Find best predictors using GAM and having separate deviance for each predictor

Answers (0)

Related Questions