Reputation: 43
I am trying to do a linear regression on R, with qualitative and quantitative variables. Here is the structure of dataset :
trip X year
A 10 500 2019
B 11 600 2019
C 9 450 2020
I want to do a regression of "A" (y, variable to explain) on the other variables (explicative variables). The problem being that with a qualitative variable like "spécialité", it then treats other variables like qualitative variables to, and thus does a regression separately on year 2019, year 2020,...
lm_amb <- lm(reg_spé_amb[,3] ~ reg_spé_amb[,2] + reg_spé_amb[,1] + reg_spé_amb[,4])
summary(lm_amb)
I thus obtain a result in the following form :
Estimate std error
intercept 400 26
reg_spé_amb[, 2]10 88 66
reg_spé_amb[, 2]11 64 10
reg_spé_amb[, 1]A 70 80
reg_spé_amb[, 4]2019 80 90
I would like to get a coefficient per A, one for the variable "year" as a whole and not as separate variables for each year, and one for the ("A"). Could anyone help me with that ?
Upvotes: 2
Views: 1041
Reputation: 3700
Your question suggests you are just starting with both statistics and R. You'll benefit from first learning the basics and there are many resources available online. Take a look for example at Modern Statistics with R.
With that advice out of the way, let's look at your questions.
You provided just three rows of data which is not enough to estimate the parameters of the proposed regression. So I add three rows of fake data.
library("tidyverse")
reg_spé_amb <-
tibble::tribble(
~spécialité, ~`nombre de praticiens`, ~`activité en valeur`, ~année,
"anesthésie", 10L, 500L, 2019L,
"chirurgie", 11L, 600L, 2019L,
"anesthésie", 9L, 450L, 2020L,
"chirurgie", 9L, 550L, 2019L,
"chirurgie", 10L, 450L, 2019L,
"anesthésie", 10L, 650L, 2020L
)
It's convenient to fit a model to the data using the formula interface. R keeps track of lots of information for us, including the variable names, so there is no need for awkward specifications such as reg_spé_amb[,1]
.
lm_amb <- lm(
formula = `activité en valeur` ~ `nombre de praticiens` + spécialité + année,
data = reg_spé_amb
)
After we fit the model, we validate it. This step is crucial because what's the point in extracting coefficients from a model that's a poor fit to the data? See 8.1.4 Model diagnostics in "Modern Statistics with R".
[Lots of work to validate the model...]
Once we are satisfied with the model, we look at the coefficients and various other statistics.
Here is how we can get the coefficient for the linear effect of year (année) and number of doctors (nombre de praticiens) on the response (activité en valeur).
summary(lm_amb)
#>
#> Call:
#> lm(formula = `activité en valeur` ~ `nombre de praticiens` +
#> spécialité + année, data = reg_spé_amb)
#>
#> Residuals:
#> 1 2 3 4 5 6
#> -3.197e-14 6.667e+00 -7.000e+01 7.667e+01 -8.333e+01 7.000e+01
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -161620.00 272131.90 -0.594 0.613
#> `nombre de praticiens` 60.00 67.33 0.891 0.467
#> spécialitéchirurgie 33.33 122.93 0.271 0.812
#> année 80.00 134.66 0.594 0.613
#>
#> Residual standard error: 106.5 on 2 degrees of freedom
#> Multiple R-squared: 0.32, Adjusted R-squared: -0.7
#> F-statistic: 0.3137 on 3 and 2 DF, p-value: 0.819
Or more succinctly and with better table formatting:
broom::tidy(lm_amb)
#> # A tibble: 4 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -161620. 272132. -0.594 0.613
#> 2 `nombre de praticiens` 60 67.3 0.891 0.467
#> 3 spécialitéchirurgie 33.3 123. 0.271 0.812
#> 4 année 80.0 135. 0.594 0.613
It's just as easy to get one coefficient (marginal effect) for each level of the categorical variable specialty (spécialité).
emmeans::emmeans(lm_amb, ~ spécialité)
#> spécialité emmean SE df lower.CL upper.CL
#> anesthésie 530 65.4 2 248 812
#> chirurgie 563 89.8 2 177 950
#>
#> Results are averaged over the levels of: année
#> Confidence level used: 0.95
As a bonus, here is how to compare the marginal effects of the two specialties. Note that comparison anesthésie - chirurgie is what we get from the summary table above as the coefficient spécialitéchirurgie. The sign is flipped because spécialitéchirurgie is actually the comparison chirurgie - anesthésie.
emmeans::emmeans(lm_amb, pairwise ~ spécialité)
#> $contrasts
#> contrast estimate SE df t.ratio p.value
#> anesthésie - chirurgie -33.3 123 2 -0.271 0.8117
#>
#> Results are averaged over the levels of: année
Understanding all of the above will help you to understand how to fit and interpret a regression model.
Created on 2022-06-25 by the reprex package (v2.0.1)
Upvotes: 2