Reputation: 697
I have the following data set containing socioeconomic variables:
> glimpse(df)
Rows: 730,099
Columns: 9
$ id <int> 25500, 25501, 25502, 25503, 25504, 25505, 25506, 25507, 25508, 25509, 25510, 25511, 25512, 25513, 25514, 25515, 25516, 255…
$ Prov <fct> Al Hoceïma, Al Hoceïma, Al Hoceïma, Al Hoceïma, Al Hoceïma, Al Hoceïma, Al Hoceïma, Al Hoceïma, Al Hoceïma, Al Hoceïma, Al…
$ Age <dbl> 45, 15, 65, 55, 35, 75, 45, 25, 55, 40, 35, 50, 70, 20, 30, 75, 50, 35, 50, 40, 35, 70, 70, 35, 35, 75, 55, 35, 25, 50, 30…
$ Edu <dbl> 5, 4, 0, 0, 0, 0, 0, 16, 0, 5, 4, 0, 0, 6, 0, 0, 0, 0, 0, 0, 9, 0, 0, 14, 4, 0, 0, 3, 5, 0, 9, 0, 0, 0, 15, 0, 0, 0, 14, 3…
$ Kids <int> 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 3, 4, 1, 0, 2, 0, 3, 2, 2, 1, 0, 5, 0, 2, 0, 0, 0, 1, 0, 2, 2, 1, 0, 5, 0, 0, 3, 0, 0, 1, 0,…
$ Mil <fct> Urban, Rural, Rural, Rural, Rural, Rural, Rural, Rural, Rural, Rural, Rural, Rural, Urban, Urban, Rural, Urban, Urban, Rur…
$ DSize <dbl+lbl> 2, 1, 4, 4, 1, 1, 3, 2, 3, 4, 1, 6, 3, 2, 5, 3, 4, 2, 4, 2, 3, 6, 3, 2, 4, 4, 3, 2, 2, 4, 1, 4, 6, 4, 5, 5, 4, 2, 5, 3…
$ taille <dbl> 2, 5, 5, 7, 5, 1, 2, 1, 4, 1, 5, 9, 5, 1, 5, 3, 7, 4, 6, 7, 2, 14, 3, 4, 1, 2, 9, 3, 3, 12, 4, 7, 2, 9, 2, 6, 5, 2, 4, 3, …
$ EP <fct> 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,…
The variable Prov
indicates the 72 provinces of the country, it looks like this:
> levels(df$Prov)
[1] "Agadir-Ida-Ou-Tanane" "Al Haouz" "Al Hoceïma" "Meknès" "Azilal"
[6] "Béni Mellal" "Benslimane" "Berkane" "Berrechid" "Boujdour"
[11] "Boulemane" "Casablanca" "Chefchaouen" "Chichaoua" "Chtouka-Ait Baha"
[16] "Driouch" "El Hajeb" "El Jadida" "El Kelâa des Sraghna" "Errachidia"
[21] "Essaouira" "Fahs-Anjra" "Fès" "Figuig" "Fquih Ben Salah"
[26] "Guelmim" "Guercif" "Ifrane" "Inezgane-Ait Melloul" "Jerada"
[31] "Kénitra" "Khémisset" "Khénifra" "Khouribga" "Laâyoune"
[36] "Larache" "Marrakech" "Médiouna" "Midelt" "Mohammadia"
[41] "Nador" "Nouaceur" "Ouarzazate" "Ouezzane" "Oujda-Angad"
[46] "Rabat" "Rehamna" "Safi" "Salé" "Sefrou"
[51] "Settat" "Sidi Bennour" "Sidi Ifni" "Sidi Kacem" "Sidi Slimane"
[56] "Skhirate-Témara" "Tanger-Assilah" "Taounate" "Taourirt" "Taroudannt"
[61] "Tata" "Taza" "Tétouan" "M'Diq-Fnideq" "Tinghir"
[66] "Tiznit" "Youssoufia" "Zagora" "Moulay Yacoub" "Tan-Tan / Assa-Zag"
[71] "Es-Semara / Tarfaya" "Oued Ed Dahab / Aousserd"
and I have another shapefile containing the location polygons for each province, it looks like this:
> map_3
Simple feature collection with 72 features and 1 field
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -17.10496 ymin: 20.7715 xmax: -0.9987581 ymax: 35.92243
Geodetic CRS: WGS 84
# A tibble: 72 × 2
PROV geometry
* <chr> <POLYGON [°]>
1 Agadir-Ida-Ou-Tanane ((-9.504189 30.3498, -9.503555 30.34989, -9.503493 30.34991, -9.502746 30.35016, -9.50251...
2 Al Haouz ((-7.344661 31.67321, -7.344881 31.67344, -7.345 31.6736, -7.346125 31.67503, -7.346162 3...
3 Al Hoceïma ((-3.926041 35.26096, -3.926289 35.26113, -3.926532 35.26142, -3.926776 35.26171, -3.9270...
4 Azilal ((-5.908287 32.30598, -5.908415 32.3063, -5.913961 32.32005, -5.915668 32.32325, -5.92306...
5 Benslimane ((-7.195987 33.18768, -7.191716 33.18865, -7.169873 33.19075, -7.164904 33.19066, -7.1557...
6 Berkane ((-2.313083 34.58991, -2.312776 34.58996, -2.312381 34.59001, -2.308861 34.59052, -2.3085...
7 Berrechid ((-7.250729 33.21856, -7.249378 33.22028, -7.246216 33.22429, -7.245882 33.22472, -7.2454...
8 Boujdour ((-12.54495 24.46466, -12.49897 24.47041, -12.4321 24.48361, -12.38604 24.49452, -12.3777...
9 Boulemane ((-4.202123 32.58081, -4.200447 32.58095, -4.183846 32.58234, -4.17291 32.58488, -4.15931...
10 Casablanca ((-7.669198 33.49706, -7.668895 33.49699, -7.66875 33.49705, -7.668693 33.49707, -7.66854...
# ℹ 62 more rows
# ℹ Use `print(n = ...)` to see more rows
When I want to run a simple logistic regression I simply run this command and it gives me the results I want:
> model <- glm(EP~Age+Edu+Kids+Mil+DSize+taille,family = binomial(link = "logit"), data = df)
> summary(model)
Call:
glm(formula = EP ~ Age + Edu + Kids + Mil + DSize + taille, family = binomial(link = "logit"),
data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.0040682 0.0165088 -0.246 0.805
Age -0.0060011 0.0002926 -20.511 <2e-16 ***
Edu -0.1165767 0.0010738 -108.562 <2e-16 ***
Kids 0.0927567 0.0038183 24.293 <2e-16 ***
MilRural 1.5594703 0.0078742 198.048 <2e-16 ***
DSize -0.4697581 0.0034187 -137.409 <2e-16 ***
taille -0.1368148 0.0026621 -51.393 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 613335 on 719910 degrees of freedom
Residual deviance: 497842 on 719904 degrees of freedom
(10188 observations deleted due to missingness)
AIC: 497856
Number of Fisher Scoring iterations: 6
Now I want to run a Geographically Weighted Logistic Regression, and for that I checked the GWModel
package manual, and found the function ggwr.basic
, which has an option for logistic regression.
DM<-gw.dist(dp.locat=coordinates(londonhp))
bw.f2 <- bw.ggwr(BATH2~FLOORSZ,data=londonhp, dMat=DM,family ="binomial")
res.binomial<-ggwr.basic(BATH2~FLOORSZ, bw=bw.f2,data=londonhp, dMat=DM,
family ="binomial")
But in their example, the location variable is a point (X Y coordinates), whereas in my case the locations are polygons, also they have one observation per location (they use the LondonHP
data set), whereas I have more than 730,000 observations scattered across 72 provinces.
Therefore, how can I run a Geographically Weighted Logistic Regression model on my data? I just gave the GWModel
package as an example; if you know a different package or function that can get the job done, please feel free to use it.
Upvotes: 0
Views: 394
Reputation: 6921
I guess the trick is to convert your data to a spatial format {GWmodel} can process. Please adapt below example to suit your purpose:
sf
-class object as I find simple features more convenient to work with.library(dplyr)
library(rnaturalearth)
provinces <- ne_states(country = 'Morocco', returnclass = 'sf')
## keep only name and geometry:
provinces <- provinces |> select(name)
## > provinces
## Simple feature collection with 16 features and 1 field
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -17.01374 ymin: 21.41997 xmax: -1.031999 ymax: 35.92652
## CRS: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## First 10 features:
## name geometry
## 58 Guelmim - Es-Semara POLYGON ((-8.817035 27.6614...
## 60 Laâyoune - Boujdour - Sakia El Hamra POLYGON ((-12.05582 25.9958...
provinces <-
provinces |>
mutate(geometry = st_centroid(geometry),
x1 = rnorm(16), # predictor 1
x2 = rnorm(16), # predictor 2
Y = rbinom(16, 1, .5) # binomial outcome
)
sf
to class sp
, which is required by {GWmodel} functions:provinces <- as_Spatial(provinces)
the_model <- ggwr.basic(Y ~ x1 + x2,
data = provinces,
bw = 50, ## I just put a fixed bandwith here
family ="binomial"
)
## > summary(the_model)
## Length Class Mode
## GW.arguments 11 -none- list
## GW.diagnostic 4 -none- list
## glms 22 -none- list
## SDF 16 SpatialPointsDataFrame S4
## CV 16 -none- numeric
## timings 2 -none- list
## this.call 5 -none- call
Upvotes: 1