Reputation: 517
I am trying to figure out how isolated certain points are within my data set. I am using two methods to determine isolation, the distance of the closest neighbor and the number of neighboring sites within a given radius. All my coordinates are in latitude and longitude
This is what my data looks like:
pond lat long area canopy avg.depth neighbor n.lat n.long n.distance n.area n.canopy n.depth n.avg.depth radius1500
A10 41.95928 -72.14605 1500 66 60.61538462
AA006 41.96431 -72.121 250 0 57.77777778
Blacksmith 41.95508 -72.123803 361 77 71.3125
Borrow.Pit.1 41.95601 -72.15419 0 0 41.44444444
Borrow.Pit.2 41.95571 -72.15413 0 0 37.7
Borrow.Pit.3 41.95546 -72.15375 0 0 29.22222222
Boulder 41.918223 -72.14978 1392 98 43.53333333
I want to put the name of the nearest neighboring pond in the column neighbor, its lat and long in n.lat and n.long, the distance between the two ponds in n.distance, and the area, canopy and avg.depth in each of the appropriate columns.
Second, I want to put the number of ponds within 1500m of the target pond into radius1500.
Does anyone know of a function or package that will help me calculate the distances/numbers that I want? If it's an issue, it won't be hard to enter the other data I need, but the nearest neighbor's name and distance, plus the number of ponds within 1500m is what I really need help with.
Thank you.
Upvotes: 41
Views: 57582
Reputation: 1780
I add below a solution using the spatialrisk
package. The key functions in this package are written in C++ (Rcpp), and are therefore very fast.
First, load the data:
df <- data.frame(pond = c("A10", "AA006", "Blacksmith", "Borrow.Pit.1",
"Borrow.Pit.2", "Borrow.Pit.3", "Boulder"),
lat = c(41.95928, 41.96431, 41.95508, 41.95601,
41.95571, 41.95546, 41.918223),
long = c(-72.14605, -72.121, -72.123803, -72.15419,
-72.15413, -72.15375, -72.14978),
area = c(1500, 250, 361, 0, 0, 0, 1392),
canopy = c(66, 0, 77, 0, 0, 0, 98),
avg.depth = c(60.61538462, 57.77777778, 71.3125, 41.44444444,
37.7, 29.22222222, 43.53333333))
The function spatialrisk::points_in_circle
calculates the observations within radius from a center point. Note that distances are calculated using the Haversine formula. Since each element of the output is a data frame, purrr::map_dfr
is used to row-bind them together:
ans1 <- purrr::map2_dfr(df$long, df$lat,
~spatialrisk::points_in_circle(df, .x, .y,
lon = long,
radius = 100000)[2,])
colnames(ans1) <- c("neighbor", "n.lat", "n.long", "n.area",
"n.canopy", "n.avg.depth", "distance_m")
neighbor n.lat n.long n.area n.canopy n.avg.depth distance_m
1 Borrow.Pit.1 41.95601 -72.15419 0 0 41.44444 765.87823
2 Blacksmith 41.95508 -72.12380 361 77 71.31250 1053.35200
3 AA006 41.96431 -72.12100 250 0 57.77778 1053.35200
4 Borrow.Pit.2 41.95571 -72.15413 0 0 37.70000 33.76321
5 Borrow.Pit.1 41.95601 -72.15419 0 0 41.44444 33.76321
6 Borrow.Pit.2 41.95571 -72.15413 0 0 37.70000 42.00128
7 Borrow.Pit.3 41.95546 -72.15375 0 0 29.22222 4158.21978
Now calculate the number of ponds within 1500m of the target pond. The function spatialrisk::concentration
sums the number of observations within a radius from center points. 1 is substracted from the number of ponds to exclude the pond itself.
df$npond <- 1
radius1500 <- spatialrisk::concentration(df, df, npond, lon_sub = long,
lon_full = long, radius = 1500,
display_progress = FALSE)$concentration - 1
Column-bind the data frames together:
cbind(df, ans1, radius1500)
pond lat long area canopy avg.depth neighbor n.lat n.long n.area n.canopy n.avg.depth distance_m radius1500
1 A10 41.95928 -72.14605 1500 66 60.61538 Borrow.Pit.1 41.95601 -72.15419 0 0 41.44444 765.87823 3
2 AA006 41.96431 -72.12100 250 0 57.77778 Blacksmith 41.95508 -72.12380 361 77 71.31250 1053.35200 1
3 Blacksmith 41.95508 -72.12380 361 77 71.31250 AA006 41.96431 -72.12100 250 0 57.77778 1053.35200 1
4 Borrow.Pit.1 41.95601 -72.15419 0 0 41.44444 Borrow.Pit.2 41.95571 -72.15413 0 0 37.70000 33.76321 3
5 Borrow.Pit.2 41.95571 -72.15413 0 0 37.70000 Borrow.Pit.1 41.95601 -72.15419 0 0 41.44444 33.76321 3
6 Borrow.Pit.3 41.95546 -72.15375 0 0 29.22222 Borrow.Pit.2 41.95571 -72.15413 0 0 37.70000 42.00128 3
7 Boulder 41.91822 -72.14978 1392 98 43.53333 Borrow.Pit.3 41.95546 -72.15375 0 0 29.22222 4158.21978 0
Upvotes: 3
Reputation: 145
Another answer which while probably slower might have an intuitive appeal for dplyr addicts.
You create a mega-grid of every single possible combination of lat/lons, then you can find the one that has the smallest distance using geosphere.
The example is where you have two datasets with different points to compare - but you can adjust it easily by duplicating the first dataset.
library(tidyverse)
library(geosphere)
library(data.table)
#This function creates a big dataframe with every possible combination
expand.grid.df <- function(...) Reduce(function(...) merge(..., by=NULL), list(...))
shortest_distance <- expand.grid.df(df1,df2) %>%
mutate(distance = distHaversine(p1 = cbind(lon_2,lat_2),
p2 = cbind(lon,lat))) %>%
group_by(ACCIDENT_NO) %>%
slice(which.min(distance))
Upvotes: 3
Reputation: 106
In Rfast, there is a function called "dista" and computes the Euclidean or the Manhattan distances only (at the moment). It gives the option to compute the k-smallest distances. Alternatively it can return the indexes of the observations with the smallest distances. The cosinus distance is basically almost the same as the Eucledean distance (exluding a constant, 2 I think).
Upvotes: 2
Reputation: 681
I add below an alternative solution using the newer sf
package, for those interested and coming to this page now (as I did).
First, load the data and create the sf
object.
# Using sf
mydata <- structure(
list(pond = c("A10", "AA006", "Blacksmith", "Borrow.Pit.1",
"Borrow.Pit.2", "Borrow.Pit.3", "Boulder"),
lat = c(41.95928, 41.96431, 41.95508, 41.95601, 41.95571, 41.95546,
41.918223),
long = c(-72.14605, -72.121, -72.123803, -72.15419, -72.15413,
-72.15375, -72.14978),
area = c(1500L, 250L, 361L, 0L, 0L, 0L, 1392L),
canopy = c(66L, 0L, 77L, 0L, 0L, 0L, 98L),
avg.depth = c(60.61538462, 57.77777778, 71.3125, 41.44444444,
37.7, 29.22222222, 43.53333333)),
class = "data.frame", row.names = c(NA, -7L))
library(sf)
data_sf <- st_as_sf(mydata, coords = c("long", "lat"),
# Change to your CRS
crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
st_is_longlat(data_sf)
sf::st_distance
calculates the distance matrix in meters using Great Circle distance when using lat/lon data.
dist.mat <- st_distance(data_sf) # Great Circle distance since in lat/lon
# Number within 1.5km: Subtract 1 to exclude the point itself
num.1500 <- apply(dist.mat, 1, function(x) {
sum(x < 1500) - 1
})
# Calculate nearest distance
nn.dist <- apply(dist.mat, 1, function(x) {
return(sort(x, partial = 2)[2])
})
# Get index for nearest distance
nn.index <- apply(dist.mat, 1, function(x) { order(x, decreasing=F)[2] })
n.data <- mydata
colnames(n.data)[1] <- "neighbor"
colnames(n.data)[2:ncol(n.data)] <-
paste0("n.", colnames(n.data)[2:ncol(n.data)])
mydata2 <- data.frame(mydata,
n.data[nn.index, ],
n.distance = nn.dist,
radius1500 = num.1500)
rownames(mydata2) <- seq(nrow(mydata2))
mydata2
pond lat long area canopy avg.depth neighbor n.lat n.long n.area n.canopy
1 A10 41.95928 -72.14605 1500 66 60.61538 Borrow.Pit.1 41.95601 -72.15419 0 0
2 AA006 41.96431 -72.12100 250 0 57.77778 Blacksmith 41.95508 -72.12380 361 77
3 Blacksmith 41.95508 -72.12380 361 77 71.31250 AA006 41.96431 -72.12100 250 0
4 Borrow.Pit.1 41.95601 -72.15419 0 0 41.44444 Borrow.Pit.2 41.95571 -72.15413 0 0
5 Borrow.Pit.2 41.95571 -72.15413 0 0 37.70000 Borrow.Pit.1 41.95601 -72.15419 0 0
6 Borrow.Pit.3 41.95546 -72.15375 0 0 29.22222 Borrow.Pit.2 41.95571 -72.15413 0 0
7 Boulder 41.91822 -72.14978 1392 98 43.53333 Borrow.Pit.3 41.95546 -72.15375 0 0
n.avg.depth n.distance radius1500
1 41.44444 766.38426 3
2 71.31250 1051.20527 1
3 57.77778 1051.20527 1
4 37.70000 33.69099 3
5 41.44444 33.69099 3
6 37.70000 41.99576 3
7 29.22222 4149.07406 0
For obtaining the closest neighbor after calculating distance, you can use sort()
with the partial = 2
argument. Depending on the amount of data, this could be much faster than using order
as in the previous solution. The package Rfast
is likely even faster but I avoid including additional packages here. See this related post for a discussion and benchmarking of various solutions: https://stackoverflow.com/a/53144760/12265198
Upvotes: 16
Reputation: 5945
Best option is to use libraries sp
and rgeos
, which enable you to construct spatial classes and perform geoprocessing.
library(sp)
library(rgeos)
Read the data and transform them to spatial objects:
mydata <- read.delim('d:/temp/testfile.txt', header=T)
sp.mydata <- mydata
coordinates(sp.mydata) <- ~long+lat
class(sp.mydata)
[1] "SpatialPointsDataFrame"
attr(,"package")
[1] "sp"
Now calculate pairwise distances between points
d <- gDistance(sp.mydata, byid=T)
Find second shortest distance (closest distance is of point to itself, therefore use second shortest)
min.d <- apply(d, 1, function(x) order(x, decreasing=F)[2])
Construct new data frame with desired variables
newdata <- cbind(mydata, mydata[min.d,], apply(d, 1, function(x) sort(x, decreasing=F)[2]))
colnames(newdata) <- c(colnames(mydata), 'neighbor', 'n.lat', 'n.long', 'n.area', 'n.canopy', 'n.avg.depth', 'distance')
newdata
pond lat long area canopy avg.depth neighbor n.lat n.long n.area n.canopy n.avg.depth
6 A10 41.95928 -72.14605 1500 66 60.61538 Borrow.Pit.3 41.95546 -72.15375 0 0 29.22222
3 AA006 41.96431 -72.12100 250 0 57.77778 Blacksmith 41.95508 -72.12380 361 77 71.31250
2 Blacksmith 41.95508 -72.12380 361 77 71.31250 AA006 41.96431 -72.12100 250 0 57.77778
5 Borrow.Pit.1 41.95601 -72.15419 0 0 41.44444 Borrow.Pit.2 41.95571 -72.15413 0 0 37.70000
4 Borrow.Pit.2 41.95571 -72.15413 0 0 37.70000 Borrow.Pit.1 41.95601 -72.15419 0 0 41.44444
5.1 Borrow.Pit.3 41.95546 -72.15375 0 0 29.22222 Borrow.Pit.2 41.95571 -72.15413 0 0 37.70000
6.1 Boulder 41.91822 -72.14978 1392 98 43.53333 Borrow.Pit.3 41.95546 -72.15375 0 0 29.22222
distance
6 0.0085954872
3 0.0096462277
2 0.0096462277
5 0.0003059412
4 0.0003059412
5.1 0.0004548626
6.1 0.0374480316
Edit: if coordinates are in degrees and you would like to calculate distance in kilometers, use package geosphere
library(geosphere)
d <- distm(sp.mydata)
# rest is the same
This should provide better results, if the points are scattered across the globe and coordinates are in degrees
Upvotes: 50
Reputation: 686
The Solution propose by @Zbynek is quite nice but if you are looking for a distance between two neighboor in km like I am , I am proposing this solution.
earth.dist<-function(lat1,long1,lat2,long2){
rad <- pi/180
a1 <- lat1 * rad
a2 <- long1 * rad
b1 <- lat2 * rad
b2 <- long2 * rad
dlat <- b1-a1
dlon<- b2-a2
a <- (sin(dlat/2))^2 +cos(a1)*cos(b1)*(sin(dlon/2))^2
c <- 2*atan2(sqrt(a),sqrt(1-a))
R <- 6378.145
dist <- R *c
return(dist)
}
Dist <- matrix(0,ncol=length(mydata),nrow=length(mydata.sp))
for (i in 1:length(mydata)){
for(j in 1:length(mydata.sp)){
Dist[i,j] <- earth.dist(mydata$lat[i],mydata$long[i],mydata.sp$lat[j],mydata.sp$long[j])
}}
DDD <- matrix(0, ncol=5,nrow=ncol(Dist)) ### RECTIFY the nb of col by the number of variable you want
for(i in 1:ncol(Dist)){
sub<- sort(Dist[,i])[2]
DDD[i,1] <- names(sub)
DDD[i,2] <- sub
DDD[i,3] <- rownames(Dist)[i]
sub_neig_atr <- Coord[Coord$ID==names(sub),]
DDD[i,4] <- sub_neig_atr$area
DDD[i,5] <- sub_neig_atr$canopy
### Your can add any variable you want here
}
DDD <- as.data.frame(DDD)
names(DDD)<-c("neigboor_ID","distance","pond","n.area","n.canopy")
data <- merge(mydata,DDD, by="pond")
You end up getting a distance in km if your coordinates are long and lat.
Any suggestions to make it better ?
Upvotes: 1