Reputation: 1263
I am familiar with QGIS but struggling with R here, and I'd like some help to do a basic geographically weighted regression with some data that is based on the centroid points of New York City's PUMA shapefile (55 points, one for each PUMA, which is basically like a big census tract).
This is my data in csv: https://www.sendspace.com/file/pj48b5
Also, if necessary (probably not), here is the shapefile data: https://www.sendspace.com/file/wbqrpb
As you can see, the table is in the following format:
lat lng variable_a 2015_median 9_yr_change 9_yr_change_new pc_change
40.8912378 -73.9101365 6 1200 380 480 31.6666666667
40.8901905 -73.8614272 8 1100 280 200 25.4545454545
40.8502191 -73.8050669 11 1100 300 530 27.2727272727
40.8561725 -73.8525618 2 1100 320 205 29.0909090909
If I do the basic regression of two variables, as so:
fit <- lm(variable_a ~ X9_yr_change_new, data=s_data)
summary(fit)
Then I get an R squared of 0.42
What I want to do next is to test the same two variables, but by using the lat and lng variables (co-ordinates of the centroids) to see if there's a stronger relationship when the geographical proximity of these points is taken into account.
Can anyone tell me the easiest way to do this either in QGIS or in R?
Upvotes: 0
Views: 887
Reputation: 12723
For weighted regression, you have to first find the weights based on location. It can be done by averaging the variable_a
response for every group of lat/lng
, and count the number of responses in each group. This number
will become the weights for the average response of ave_var_a
. Then conduct weighted regression by passing weights = number
to the lm
function.
Since your data has only one response per location, the fitted results of both unweighted and weighted regression are same. It can be seen using summary.aov()
function.
I am showing both unweighted and weighted regression below.
Setting up weighted data:
df1 <- read.table(file = 's_data.csv', header = TRUE, sep = ',', stringsAsFactors = FALSE )
head(df1)
# lat lng variable_a X2015_median X9_yr_change X9_yr_change_new pc_change
# 1 40.89124 -73.91014 6 1200 380 480 31.66667
# 2 40.89019 -73.86143 8 1100 280 200 25.45455
# 3 40.85022 -73.80507 11 1100 300 530 27.27273
# 4 40.85617 -73.85256 2 1100 320 205 29.09091
# 5 40.84518 -73.88736 21 850 260 250 30.58824
# 6 40.86465 -73.90325 2 1000 230 300 23.00000
library(data.table)
setDT(df1)
df1[,
j = `:=` (number = .N, # total number of responses per location
ave_var_a = mean(variable_a)), # average response per location
by = c('lat', 'lng')]
head(df1)
# lat lng variable_a X2015_median X9_yr_change X9_yr_change_new pc_change number ave_var_a
# 1: 40.89124 -73.91014 6 1200 380 480 31.66667 1 6
# 2: 40.89019 -73.86143 8 1100 280 200 25.45455 1 8
# 3: 40.85022 -73.80507 11 1100 300 530 27.27273 1 11
# 4: 40.85617 -73.85256 2 1100 320 205 29.09091 1 2
# 5: 40.84518 -73.88736 21 850 260 250 30.58824 1 21
# 6: 40.86465 -73.90325 2 1000 230 300 23.00000 1 2
Perform lm
unweighted and weighted regression:
# unweighted regression
fit <- lm( variable_a ~ X9_yr_change_new, data= df1 )
summary.aov(fit)
# Df Sum Sq Mean Sq F value Pr(>F)
# X9_yr_change_new 1 6537830 6537830 39.23 6.89e-08 ***
# Residuals 53 8833855 166677
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# weighted regression
weighted_fit <- lm( ave_var_a ~ X9_yr_change_new, data= df1, weights = number )
summary.aov(weighted_fit)
# Df Sum Sq Mean Sq F value Pr(>F)
# X9_yr_change_new 1 6537830 6537830 39.23 6.89e-08 ***
# Residuals 53 8833855 166677
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Upvotes: 2