Reputation: 312
I'm trying to replicate the work done in the really cool nearest neighbor question but do it for each area in my dataframe rather than the group as a whole.
My data ncbaby (don't ask) looks like this:
id printid areaname latitude longitude
1 7912048 233502729 073 36.06241 -80.44229
2 735253 171241999 Area 12-06 35.54452 -78.75388
3 4325564 85564887 Area 12-04 35.49328 -78.73756
4 4222241 85461255 Area 12-06 35.53621 -78.75553
5 11997754 356053648 Area 12-04 35.49328 -78.73756
6 13444458 536073775 Area 12-06 35.53987 -78.74922
I'd like to run the function for each areaname. I tried calling split but but the distance function won't call on a list.
splitfile <- split(ncbaby, ncbaby$precinctname)
c <- gDistance(splitfile, byid=TRUE)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘is.projected’ for signature ‘"list"’
The closest I've gotten is:
library(sp)
library(rgeos)
uniq <- unique(unlist(ncbaby$areaname))
for (i in 1:length(uniq)){
data_1 <- subset(ncbaby, areaname == uniq[i])
sp.mydata <- data_1
coordinates(sp.mydata) <- ~longitude+latitude
d <- gDistance(sp.mydata, byid=TRUE)
min.d <- apply(d, 1, function(x) order(x, decreasing=F)[2])
newdata <- cbind(data_1, data_1[min.d,], apply(d, 1, function(x) sort(x, decreasing=F)[2]))
colnames(newdata) <- c(colnames(data_1), 'n.ID',
'n.printid', '.Areaname', 'n.latitude', 'n.longitude',
'distance')
}
The problem here is that it ends up kicking out only the last returned value. ideas? I'd be interested/open to modifying the function as well, as it seems it might be more efficient.
Upvotes: 4
Views: 1644
Reputation: 7435
Here is a solution that adapts the solution in the linked post to separate groups by area. First, we define two functions:
library(sp)
library(rgeos)
nearest.neighbor <- function(lon,lat) {
df <- data.frame(lon,lat)
coordinates(df) <- ~lon+lat
d <- gDistance(df, byid=TRUE)
# remove the self distance from being considered and use which.min to find the nearest neighbor
d[cbind(1:nrow(d),1:nrow(d))] <- NA
min.d <- rbind(apply(d,1,function(x) {ind <- which.min(x); list(ind=ind,distance=x[ind])}))
}
order.by.ind <- function (x,ind) x[ind]
The nearest.neighbor
function follows closely to the code in the linked post except that it returns a vector of lists. Each list contains the index to the nearest neighbor and the distance to that neighbor. The key here is that we only want to compute the distance once to return both the min distance and the corresponding index. Note that we remove the self-distance from consideration by setting the diagonal of d
to NA
and then use which.min
to locate the nearest neighbor, which avoids having to do a full sort.
The order.by.ind
function simply reorders the input column x
according to the indices ind
.
With these two functions, we can use mutate
from the dplyr
package to compute the desired columns grouped by areaname
:
library(dplyr)
result <- ncbaby %>% group_by(areaname) %>%
mutate(min.d=nearest.neighbor(longitude, latitude)) %>%
mutate_each(vars=c(id, printid, latitude, longitude),
funs(order.by.ind, "order.by.ind", order.by.ind(.,ind=unlist(min.d)[c(TRUE,FALSE)]))) %>%
mutate(distance=unlist(min.d)[c(FALSE,TRUE)]) %>%
mutate(.Areaname=areaname) %>%
select(-min.d)
newvars <- c('n.ID', 'n.printid', 'n.latitude', 'n.longitude', 'distance', '.Areaname')
colnames(result) <- c(colnames(ncbaby), newvars)
Notes:
mutate
creates a temporary column min.d
containing the list of ind
and distance
to the nearest neighbor. This is the nearest neighbor in the area since we are group_by
areaname
.mutate_each
creates a new column for each variable in vars
by reordering that column in accordance to the ind
. Note that we extract the ind
from min.d
by unlisting and then extracting the odd elements using [c(TRUE,FALSE)]
.mutate
creates the distance
column by extracting the distance
from min.d
. Again, this is by unlisting but then extracting the even elements using [c(FALSE,TRUE)]
.mutate
is actually not needed because the .Areaname
column will be redundant to areaname
in the result.min.d
column from the result and set the column names of the resulting data frame as desired.The result using your data is:
print(result)
##Source: local data frame [7 x 11]
##Groups: areaname [3]
##
## id printid areaname latitude longitude n.ID n.printid n.latitude n.longitude distance .Areaname
## <int> <int> <fctr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <fctr>
##1 7912048 233502729 073 36.06241 -80.44229 7912049 233502730 36.06251 -80.44329 0.001004988 073
##2 735253 171241999 Area 12-06 35.54452 -78.75388 13444458 536073775 35.53987 -78.74922 0.006583168 Area 12-06
##3 4325564 85564887 Area 12-04 35.49328 -78.73756 11997754 356053648 35.49328 -78.73756 0.000000000 Area 12-04
##4 4222241 85461255 Area 12-06 35.53621 -78.75553 13444458 536073775 35.53987 -78.74922 0.007294635 Area 12-06
##5 11997754 356053648 Area 12-04 35.49328 -78.73756 4325564 85564887 35.49328 -78.73756 0.000000000 Area 12-04
##6 13444458 536073775 Area 12-06 35.53987 -78.74922 735253 171241999 35.54452 -78.75388 0.006583168 Area 12-06
##7 7912049 233502730 073 36.06251 -80.44329 7912048 233502729 36.06241 -80.44229 0.001004988 073
where I added a new row for areaname="073"
so that each area has at least two rows.
Upvotes: 0
Reputation: 23574
I checked the linked post and revised the idea a bit. I thought using apply()
may not be a good idea for a large data set. So I rather used a data.table-related approach. First, I converted my sample data to a SpatialPointsDataFrame. Then, I split the data by a group variable (i.e.,group). As Eddie suggested, I utilized lapply()
with data.table functions. When you use gDistance()
, you have a two-dimensional vector as an output. I converted that to a data.table object so that following data process potentially goes faster. I reshaped the dt object with melt()
and remove all data points with distance = 0. Finally, I took the first row for each Var1
. Please note that Var1
here represents each row of the sample data, mydf
. The last job was to add the new distance vector to the original data frame. I hope this will help you.
DATA
group user_id latitude longitude
1 B23 85553 -34.44011 172.6954
2 B23 85553 -34.43929 172.6939
3 B23 85553 -34.43929 172.6939
4 B23 85553 -34.43851 172.6924
5 B23 57357 -34.42747 172.6778
6 B23 57357 -34.42747 172.6778
7 B23 57357 -34.42747 172.6778
8 B23 98418 -34.43119 172.7014
9 B23 98418 -34.43225 172.7023
10 B23 98418 -34.43224 172.7023
11 B23 98418 -34.43224 172.7023
12 B24 57357 -34.43647 172.7141
13 B24 57357 -34.43647 172.7141
14 B24 57357 -34.43647 172.7141
15 B24 98418 -34.43904 172.7172
16 B24 98418 -34.43904 172.7172
17 B24 98418 -34.43904 172.7172
18 B24 98418 -34.43925 172.7168
19 B24 98418 -34.43915 172.7169
20 B24 98418 -34.43915 172.7169
21 B24 98418 -34.43915 172.7169
22 B24 98418 -34.43915 172.7169
CODE
library(sp)
library(rgeos)
library(data.table)
# Copy the original
temp <- mydf
#DF to SPDF
coordinates(temp) <- ~longitude+latitude
# Split the data by a group variable
mylist <- split(temp, f = temp$group)
#For each element in mylist, apply gDistance, reshape the output of
# gDistance and create a data.table. Then, reshape the data, remove
# rows with distance = 0. Finally, choose the first row for each
# variable. levels in variable represents rows in mydf.
out <- rbindlist(
lapply(mylist, function(x){
d <- setDT(melt(gDistance(x, byid = TRUE)))
setorder(d, Var1, value)
d <- d[value > 0]
d <- d[, .SD[1], by = Var1]
d
})
)
out <- cbind(mydf, distance = out$value)
# group user_id latitude longitude distance
#1 B23 85553 -34.44011 172.6954 1.743945e-03
#2 B23 85553 -34.43929 172.6939 1.661118e-03
#3 B23 85553 -34.43929 172.6939 1.661118e-03
#4 B23 85553 -34.43851 172.6924 1.661118e-03
#5 B23 57357 -34.42747 172.6778 1.836642e-02
#6 B23 57357 -34.42747 172.6778 1.836642e-02
#7 B23 57357 -34.42747 172.6778 1.836642e-02
#8 B23 98418 -34.43119 172.7014 1.369100e-03
#9 B23 98418 -34.43225 172.7023 1.456022e-05
#10 B23 98418 -34.43224 172.7023 1.456022e-05
#11 B23 98418 -34.43224 172.7023 1.456022e-05
#12 B24 57357 -34.43647 172.7141 3.862696e-03
#13 B24 57357 -34.43647 172.7141 3.862696e-03
#14 B24 57357 -34.43647 172.7141 3.862696e-03
#15 B24 98418 -34.43904 172.7172 3.245705e-04
#16 B24 98418 -34.43904 172.7172 3.245705e-04
#17 B24 98418 -34.43904 172.7172 3.245705e-04
#18 B24 98418 -34.43925 172.7168 1.393162e-04
#19 B24 98418 -34.43915 172.7169 1.393162e-04
#20 B24 98418 -34.43915 172.7169 1.393162e-04
#21 B24 98418 -34.43915 172.7169 1.393162e-04
#22 B24 98418 -34.43915 172.7169 1.393162e-04
DATA in dput()
mydf <- structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("B23",
"B24"), class = "factor"), user_id = c(85553L, 85553L, 85553L,
85553L, 57357L, 57357L, 57357L, 98418L, 98418L, 98418L, 98418L,
57357L, 57357L, 57357L, 98418L, 98418L, 98418L, 98418L, 98418L,
98418L, 98418L, 98418L), latitude = c(-34.440114, -34.43929,
-34.43929, -34.438507, -34.427467, -34.427467, -34.427467, -34.431187,
-34.432254, -34.43224, -34.43224, -34.436472, -34.436472, -34.436472,
-34.439038, -34.439038, -34.439038, -34.439246, -34.439149, -34.439149,
-34.439149, -34.439149), longitude = c(172.695443, 172.693906,
172.693906, 172.692441, 172.677763, 172.677763, 172.677763, 172.701413,
172.702284, 172.702288, 172.702288, 172.71411, 172.71411, 172.71411,
172.717203, 172.717203, 172.717203, 172.716798, 172.716898, 172.716898,
172.716898, 172.716898)), .Names = c("group", "user_id", "latitude",
"longitude"), row.names = c(NA, -22L), class = "data.frame")
Upvotes: 3