user19304348
user19304348

Reputation: 43

regression lm on R qualitative and quantitative variables

I am trying to do a linear regression on R, with qualitative and quantitative variables. Here is the structure of dataset :

trip                             X                 year

A 10                          500                  2019
B 11                          600                  2019
C 9                           450                  2020

I want to do a regression of "A" (y, variable to explain) on the other variables (explicative variables). The problem being that with a qualitative variable like "spécialité", it then treats other variables like qualitative variables to, and thus does a regression separately on year 2019, year 2020,...

lm_amb <- lm(reg_spé_amb[,3] ~ reg_spé_amb[,2] + reg_spé_amb[,1] + reg_spé_amb[,4])
summary(lm_amb)

I thus obtain a result in the following form :

                                                 Estimate            std error
                                                          
intercept                                         400                   26
reg_spé_amb[, 2]10                                 88                   66
reg_spé_amb[, 2]11                                 64                   10
reg_spé_amb[, 1]A                                  70                   80
reg_spé_amb[, 4]2019                               80                   90

I would like to get a coefficient per A, one for the variable "year" as a whole and not as separate variables for each year, and one for the ("A"). Could anyone help me with that ?

Upvotes: 2

Views: 1041

Answers (1)

dipetkov
dipetkov

Reputation: 3700

Your question suggests you are just starting with both statistics and R. You'll benefit from first learning the basics and there are many resources available online. Take a look for example at Modern Statistics with R.

With that advice out of the way, let's look at your questions.

You provided just three rows of data which is not enough to estimate the parameters of the proposed regression. So I add three rows of fake data.

library("tidyverse")

reg_spé_amb <-
  tibble::tribble(
    ~spécialité, ~`nombre de praticiens`, ~`activité en valeur`, ~année,
    "anesthésie", 10L, 500L, 2019L,
    "chirurgie", 11L, 600L, 2019L,
    "anesthésie", 9L, 450L, 2020L,
    "chirurgie", 9L, 550L, 2019L,
    "chirurgie", 10L, 450L, 2019L,
    "anesthésie", 10L, 650L, 2020L
  )

It's convenient to fit a model to the data using the formula interface. R keeps track of lots of information for us, including the variable names, so there is no need for awkward specifications such as reg_spé_amb[,1].

lm_amb <- lm(
  formula = `activité en valeur` ~ `nombre de praticiens` + spécialité + année,
  data = reg_spé_amb
)

After we fit the model, we validate it. This step is crucial because what's the point in extracting coefficients from a model that's a poor fit to the data? See 8.1.4 Model diagnostics in "Modern Statistics with R".

[Lots of work to validate the model...]

Once we are satisfied with the model, we look at the coefficients and various other statistics.

Here is how we can get the coefficient for the linear effect of year (année) and number of doctors (nombre de praticiens) on the response (activité en valeur).

summary(lm_amb)
#> 
#> Call:
#> lm(formula = `activité en valeur` ~ `nombre de praticiens` + 
#>     spécialité + année, data = reg_spé_amb)
#> 
#> Residuals:
#>          1          2          3          4          5          6 
#> -3.197e-14  6.667e+00 -7.000e+01  7.667e+01 -8.333e+01  7.000e+01 
#> 
#> Coefficients:
#>                          Estimate Std. Error t value Pr(>|t|)
#> (Intercept)            -161620.00  272131.90  -0.594    0.613
#> `nombre de praticiens`      60.00      67.33   0.891    0.467
#> spécialitéchirurgie         33.33     122.93   0.271    0.812
#> année                       80.00     134.66   0.594    0.613
#> 
#> Residual standard error: 106.5 on 2 degrees of freedom
#> Multiple R-squared:   0.32,  Adjusted R-squared:   -0.7 
#> F-statistic: 0.3137 on 3 and 2 DF,  p-value: 0.819

Or more succinctly and with better table formatting:

broom::tidy(lm_amb)
#> # A tibble: 4 × 5
#>   term                    estimate std.error statistic p.value
#>   <chr>                      <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept)            -161620.   272132.     -0.594   0.613
#> 2 `nombre de praticiens`      60        67.3     0.891   0.467
#> 3 spécialitéchirurgie         33.3     123.      0.271   0.812
#> 4 année                       80.0     135.      0.594   0.613

It's just as easy to get one coefficient (marginal effect) for each level of the categorical variable specialty (spécialité).

emmeans::emmeans(lm_amb, ~ spécialité)
#>  spécialité emmean   SE df lower.CL upper.CL
#>  anesthésie    530 65.4  2      248      812
#>  chirurgie     563 89.8  2      177      950
#> 
#> Results are averaged over the levels of: année 
#> Confidence level used: 0.95

As a bonus, here is how to compare the marginal effects of the two specialties. Note that comparison anesthésie - chirurgie is what we get from the summary table above as the coefficient spécialitéchirurgie. The sign is flipped because spécialitéchirurgie is actually the comparison chirurgie - anesthésie.

emmeans::emmeans(lm_amb, pairwise ~ spécialité)
#> $contrasts
#>  contrast               estimate  SE df t.ratio p.value
#>  anesthésie - chirurgie    -33.3 123  2  -0.271  0.8117
#> 
#> Results are averaged over the levels of: année

Understanding all of the above will help you to understand how to fit and interpret a regression model.

Created on 2022-06-25 by the reprex package (v2.0.1)

Upvotes: 2

Related Questions