dez93_2000
dez93_2000

Reputation: 1875

R: Append data when points overlap/within distance; add buffer rectangle to set1, add radius to set2

I have a database (netcdf) of eddies moving through the ocean, and a number of tracks of fish doing the same. I'm trying to append eddy info to the fish track when the fish is within range of the eddy (locations within a (varying) threshold distance and dates identical). Data structure, complications, and reproducible example follow. I suspect this isn't super hard, I just keep tying myself in knots trying to work out a neat solution. Thanks in advance for any thoughts! fish&eddyboxes

Data structure / reproducible example: 3 same-days of fish & eddy track

library(lubridate)
fish <- data.frame(lat = c(42.1, 42.6, 43.2), lon = c(-10, -10.1, -10.2), date = ymd(c("1990-01-01", "1990-01-02", "1990-01-02")))
eddy <- data.frame(lat = c(44, 42.3, 40), lon = c(-15, -10.1, -6), date = ymd(c("1990-01-01", "1990-01-02", "1990-01-02")), track = c(1,1,1), radius = c(81,82,83), vals = c(11,12,13))

Here's where it gets a bit trickier:

  1. Varying radius circular eddy buffer

The size of the eddy (from which to calculate overlap or distance) is given by the radius column, so:

sf_eddy <- st_as_sf(eddy)
st_buffer(sf_eddy, dist = radius)

For st_buffer "All operations work on a per-feature basis" so I'm not sure if that'll treat sf_eddy as a single feature (probably (multi)linestring) or nrows of separate points. I imagine I can play around with that until I get it to work, such that each eddy-day (n=151937) is an sf buffer circle which I can presumably append its vals to, if they're not included already.

  1. Set-distance latlong rectangle fish buffer

There's a fixed error distance for fish positions of +-1.9°lat, +-0.77°lon. So conceptually I want to create a tall thin rectangle polygon around each day's fish location. st_buffer doesn't look to allow this but there may be other options, potentially by creating a polygon around each centroid as per the square example here.

  1. If fish & any eddy overlap in space & time, append eddy vals to fish

I could potentially slash the workload by only running the above steps when eddy$date==fish$date? Though the eddy calculation would be a one-off and I can save the output. I suspect the fish box calculation won't be particularly taxing either. Anyway:

fish$vals <- rep(NA, nrow(fish))    
sf_fish <- sf_as_sf(fish)
for (i in seq(along = sf_fish)){
if(st_intersects(sf_fish[i,],sf_eddy)) sf_fish[i,"vals"] <- sf_eddy$vals}

This is probably an awful way of doing it, obviously.

But is any of this even vaguely a reasonable way of solving this problem, or am I missing much more elegant solutions due to my amateurish knowledge of sf, sp, rgeos, etc? I have 166 fish files with average 286 days of locations = 47476 total lookups. Would it be good/fast to create an index of fish/eddy date match pairs (there may be multiple eddies on the same day) and then only test intersections on those? Since I need to append eddy$vals (actually 2 or 3 columns) to fish, would it be better to use spatial join, perhaps st_join?

Edit: comment & output map oddities: 600,000 total fish/eddy row ggplot output:

North atlantic fish & eddy tracks

Looks like I need to fix the projection as the eddies aren't printing west of 0. They're definitely in the same CRS now; maybe the originals are 0-360 not -180:180. Also while fish boxes print nicely, eddies are too small: mean r=80km=~1deg@45N, and they're definitely printing much smaller than that. I'll have a think about that too. Also: will for() have a problem if 2 eddies exist on the same day?

Edit2: originals were 0:360 which I corrected. Eddy radii still too small: small radii eddies Per code discussion, original eddy file with latlons was converted with st_as_sf to crs4326 (wgs84) then st_transform(sf_eddy,6931) which is North Atlantic, per image above. geometry column in resultant file is POLYGON, XY dimension, values are in projected coordinates not latlons, so e.g. sf_eddy_buffered$geometry[1] bbox: xmin -4894790 ymin -2418555 xmax -4894648 ymax -2418413. speed_radius is 71 so the difference between xmin & xmax is 142 which is 71*2 which is right. Does radius unit not equate to projection distance unit? Stewart's examples were 81000. Eddy radius unit: km. Projection / geometry unit: https://epsg.io/6931 says units are indeed metre. Multiply radius by 1000 & try again: big beautiful eddies So that's all good. Just tried the updated st_join code, think it could use some work - takes absolutely forever probably on account of the 3639209 eddy rows vs 461 fish rows. Also it has created multiple rows for multiple eddies for the same fish on the same day. Potentially a function of the relatively large overlap opportunity given the error buffer box around the fish. I suspect I'll be able to fix this by marking duplicates and deleting the duplicates which are furthest away. The current way of doing (via v and rbind) it is better than the previous way (direct cell assignment of a value) IMO since otherwise multiple values would be sent to a single cell, either crashing it or silently overwriting.

For speeding up the code I was thinking that I could subset the eddy file to only the overlap in date range of the current fish, and use that subset in the join loop. The join loop already subsets the fish and the eddy files by date anyway so maybe this is redundant, or maybe reducing the size of the eddy data that needs to be worked on will give the CPU/RAM a reduced workload each time?

Edit3: this DRASTICALLY speeds things up. Adding a simple loop counter for thisDate showed it completed at 275 of 461 rows. And produced a 481 length file. Possibly there are lots of days where the fish wasn't in an eddy (thisDate loop results in no v and no rbind), and lots of days of duplicates where the fish was in 2+ eddies (thisDate loop results in multirow v and rbind). Still feels like it should process 461 loops, however...

Edit4: a small tweak:

v <- st_join(thisFishRow, thisEddyRow, left = F, largest = TRUE)

largest means it'll only join the eddy with the largest overlap area, resulting in a max v rowlength of 1 and thus 1 value per day max. However, nuances of the paths mean the fish can seem to go back and forth between different ongoing eddies. My suspicion (based on a similar study) is that they are more likely to stay within a strong anticyclonic eddy system. Radius (and thus area) of eddy controls overlap possibly more than position. For a 1.5 month section I found the highest-overlap eddy track number bounced between 3 eddies. Eddy track number 217008 is the most powerfully anticyclonic. Here's a (crap) plot: 1.5monthoverlaps As you can see, 217008 is right in the centre of the fish error box whereas the larger, more diffuse, lower powered eddies 217343 and 39625 are off to the edge. However their larger size often sees them bumped to the top because they have more area to give, and centroid proximity isn't counted. So: if fishbox overlaps with eddy on same day then include eddy in shortlist (thisFishRow & thisEddyRow stay the same). Then: Choose eddy from shortlist based on lowest distance from centroids (something instead of st_join). To be continued!

Edit5:

fishNearEddies <- NULL
for (thisDate in sf_df_nona$date) {
 thisFishRow <- sf_df_nona[sf_df_nona$date == thisDate, ]
 thisEddyRow <- sebDateSub[sebDateSub$date == thisDate, ]
 overlapToday <- st_join(thisFishRow, thisEddyRow, left = F)
 if (nrow(overlapToday) > 0) {
  overlapEddies <- thisEddyRow[thisEddyRow$track %in% overlapToday$track,] 
  st_distance(x = thisFishRow, y = overlapEddies, by_element = TRUE) #0 0 due to overlap min dist = 0
 } #close if
} #close for

st_distance won't work because the spatial features (buffer box & buffer circle) overlap, so min distance = 0. I ALSO need centroids to test against, I presume?

Edit6: Final Edit with working code, & answer award to Stewart for all his help. Thanks again sir.

# overlap join loop####
fishNearEddies <- NULL
EdCentroidDist <- NULL
counter <- 1
ofhowmany <- length(sf_df_nona$date)
for (thisDate in sf_df_nona$date) {
  thisFishRow <- sf_df_nona[sf_df_nona$date == thisDate, ] #will be 1 row per day
  thisEddyRow <- sebDateSub[sebDateSub$date == thisDate, ] #all eddies that day, could be multi rows
  overlapToday <- st_join(thisFishRow, thisEddyRow, left = F) #will join only if they spatial intersect and (already) time match
  if (nrow(overlapToday) > 0) {
    # now need to join based on closest centroid and NOT on highest overlap
    overlapEddies <- thisEddyRow[thisEddyRow$track %in% overlapToday$track,] #subset TER by overlap tracks
    fishcentroid <- st_centroid(thisFishRow)
    eddycentroid <- st_centroid(overlapEddies)
    fishEdDists <- st_distance(x = fishcentroid, y = eddycentroid, by_element = TRUE) #result vector corresponding to overlapEddies rownumbers
    fishNearEddies <- rbind(fishNearEddies, overlapEddies[which.min(fishEdDists),]) #row index for overlapEddies, no index but has date
    EdCentroidDist <- rbind(EdCentroidDist, as.numeric(min(fishEdDists)))
  } #close if
  print(paste0(counter, " of ", ofhowmany, " fish days"))
  counter <- counter + 1
} #close for

if (!is.null(fishNearEddies)) {  #if there are overlaps, do processing. If not it'll fail

  fishNearEddies %<>% as.data.frame() %>% # convert to nonspatial so I can remove buffer column
    select(-geometry) %>% # remove geometry column. Attributes still remain. Whatever?
    cbind(EdCentroidDist) #add to FNE as column

Upvotes: 2

Views: 744

Answers (1)

Stewart Macdonald
Stewart Macdonald

Reputation: 2142

Here's something that should get you going:

library(lubridate)
library(sf)
library(ggplot2)

# sample data
fish <- data.frame(
    lat = c(41.1, 43.6, 44.2),
    lon = c(-7, -11, -15),
    date = ymd(c("1990-01-01", "1990-01-02", "1990-01-03"))
)

# Add a blank column for the eddy values we want
fish$vals <- rep(NA, nrow(fish))

eddy <- data.frame(
    lat = c(44, 42.3, 40),
    lon = c(-6, -10.1, -15),
    date = ymd(c("1990-01-01", "1990-01-02", "1990-01-03")),
    track = c(1,1,1),
    radius = c(81000,82000,83000),
    vals = c(11,12,13)
)

# Convert eddy to simple features, using WGS84 CRS
sf_eddy <- st_as_sf(eddy,  coords = c("lon", "lat"), crs=4326)

# Transform from geographical to projected so we can buffer correctly. You might need to pick a different CRS
sf_eddy <- st_transform(sf_eddy, 3035) # ETRS89 / LAEA Europe

# Buffer the eddy points based on their radii
sf_eddy_buffered <- st_buffer(sf_eddy, dist = sf_eddy$radius)

# Add error to fish position. There's probably a better way to do this.
fishErrLat <- 1.9
fishErrLon <- 0.77
fish$buffer <- paste('POLYGON((',
                    fish$lon - fishErrLon, ' ', fish$lat + fishErrLat, ',',
                    fish$lon + fishErrLon, ' ', fish$lat + fishErrLat, ',',
                    fish$lon + fishErrLon, ' ', fish$lat - fishErrLat, ',',
                    fish$lon - fishErrLon, ' ', fish$lat - fishErrLat, ',',

                    fish$lon - fishErrLon, ' ', fish$lat + fishErrLat,
                    '))',
                    sep=''
                )

# Convert fish to simple features, using WGS84 CRS
sf_fish <- st_as_sf(fish, wkt='buffer', crs=4326)

# Transform from geographical to projected
sf_fish <- st_transform(sf_fish, 3035)


#Plot what we've got so far    
g <- ggplot() + 
        geom_sf(data=sf_eddy_buffered, aes(fill=date)) +
        geom_sf(data=sf_fish, aes(fill=date))

print(g)

GGPlot of eddies and fish

# Check for overlap
fishNearEddies <- NULL

for (thisDate in unique(sf_fish$date)) {

    thisFishRow <- sf_fish[sf_fish$date==thisDate, ]
    thisEddyRow <- sf_eddy_buffered[sf_eddy_buffered$date==thisDate, ]

    v <- st_join(thisFishRow, thisEddyRow, left=F)
    if (nrow(v) > 0) {
        fishNearEddies <- rbind(fishNearEddies, v)
    }

}

And check the results:

> fishNearEddies
Simple feature collection with 1 feature and 7 fields
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 2588691 ymin: 2402830 xmax: 2703342 ymax: 2624501
epsg (SRID):    3035
proj4string:    +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
   lat lon     date.x     date.y track radius vals                         buffer
2 43.6 -11 1990-01-02 1990-01-02     1  82000   12 POLYGON ((2644792 2624501, ...

This will give you only those fish that intersect with an eddy.

Upvotes: 3

Related Questions