Reputation: 1875
I have a database (netcdf) of eddies moving through the ocean, and a number of tracks of fish doing the same. I'm trying to append eddy info to the fish track when the fish is within range of the eddy (locations within a (varying) threshold distance and dates identical). Data structure, complications, and reproducible example follow. I suspect this isn't super hard, I just keep tying myself in knots trying to work out a neat solution. Thanks in advance for any thoughts!
Data structure / reproducible example: 3 same-days of fish & eddy track
library(lubridate)
fish <- data.frame(lat = c(42.1, 42.6, 43.2), lon = c(-10, -10.1, -10.2), date = ymd(c("1990-01-01", "1990-01-02", "1990-01-02")))
eddy <- data.frame(lat = c(44, 42.3, 40), lon = c(-15, -10.1, -6), date = ymd(c("1990-01-01", "1990-01-02", "1990-01-02")), track = c(1,1,1), radius = c(81,82,83), vals = c(11,12,13))
Here's where it gets a bit trickier:
The size of the eddy (from which to calculate overlap or distance) is given by the radius column, so:
sf_eddy <- st_as_sf(eddy)
st_buffer(sf_eddy, dist = radius)
For st_buffer "All operations work on a per-feature basis" so I'm not sure if that'll treat sf_eddy as a single feature (probably (multi)linestring) or nrows of separate points. I imagine I can play around with that until I get it to work, such that each eddy-day (n=151937) is an sf buffer circle which I can presumably append its vals
to, if they're not included already.
There's a fixed error distance for fish positions of +-1.9°lat, +-0.77°lon. So conceptually I want to create a tall thin rectangle polygon around each day's fish location. st_buffer
doesn't look to allow this but there may be other options, potentially by creating a polygon around each centroid as per the square
example here.
I could potentially slash the workload by only running the above steps when eddy$date==fish$date? Though the eddy calculation would be a one-off and I can save the output. I suspect the fish box calculation won't be particularly taxing either. Anyway:
fish$vals <- rep(NA, nrow(fish))
sf_fish <- sf_as_sf(fish)
for (i in seq(along = sf_fish)){
if(st_intersects(sf_fish[i,],sf_eddy)) sf_fish[i,"vals"] <- sf_eddy$vals}
This is probably an awful way of doing it, obviously.
But is any of this even vaguely a reasonable way of solving this problem, or am I missing much more elegant solutions due to my amateurish knowledge of sf, sp, rgeos, etc? I have 166 fish files with average 286 days of locations = 47476 total lookups. Would it be good/fast to create an index of fish/eddy date match pairs (there may be multiple eddies on the same day) and then only test intersections on those? Since I need to append eddy$vals
(actually 2 or 3 columns) to fish
, would it be better to use spatial join, perhaps st_join
?
Edit: comment & output map oddities: 600,000 total fish/eddy row ggplot output:
Looks like I need to fix the projection as the eddies aren't printing west of 0. They're definitely in the same CRS now; maybe the originals are 0-360 not -180:180. Also while fish boxes print nicely, eddies are too small: mean r=80km=~1deg@45N, and they're definitely printing much smaller than that. I'll have a think about that too. Also: will for() have a problem if 2 eddies exist on the same day?
Edit2: originals were 0:360 which I corrected. Eddy radii still too small:
Per code discussion, original eddy file with latlons was converted with
st_as_sf
to crs4326 (wgs84) then st_transform(sf_eddy,6931)
which is North Atlantic, per image above. geometry column in resultant file is POLYGON, XY dimension, values are in projected coordinates not latlons, so e.g. sf_eddy_buffered$geometry[1]
bbox
: xmin -4894790 ymin -2418555 xmax -4894648 ymax -2418413
. speed_radius
is 71 so the difference between xmin & xmax is 142 which is 71*2 which is right. Does radius unit not equate to projection distance unit? Stewart's examples were 81000. Eddy radius unit: km. Projection / geometry unit: https://epsg.io/6931 says units are indeed metre. Multiply radius by 1000 & try again:
So that's all good. Just tried the updated
st_join
code, think it could use some work - takes absolutely forever probably on account of the 3639209 eddy rows vs 461 fish rows. Also it has created multiple rows for multiple eddies for the same fish on the same day. Potentially a function of the relatively large overlap opportunity given the error buffer box around the fish.
I suspect I'll be able to fix this by marking duplicates and deleting the duplicates which are furthest away. The current way of doing (via v and rbind) it is better than the previous way (direct cell assignment of a value) IMO since otherwise multiple values would be sent to a single cell, either crashing it or silently overwriting.
For speeding up the code I was thinking that I could subset the eddy file to only the overlap in date range of the current fish, and use that subset in the join loop. The join loop already subsets the fish and the eddy files by date anyway so maybe this is redundant, or maybe reducing the size of the eddy data that needs to be worked on will give the CPU/RAM a reduced workload each time?
Edit3: this DRASTICALLY speeds things up. Adding a simple loop counter for thisDate
showed it completed at 275 of 461 rows. And produced a 481 length file. Possibly there are lots of days where the fish wasn't in an eddy (thisDate
loop results in no v
and no rbind
), and lots of days of duplicates where the fish was in 2+ eddies (thisDate
loop results in multirow v
and rbind
). Still feels like it should process 461 loops, however...
Edit4: a small tweak:
v <- st_join(thisFishRow, thisEddyRow, left = F, largest = TRUE)
largest
means it'll only join the eddy with the largest overlap area, resulting in a max v
rowlength of 1 and thus 1 value per day max. However, nuances of the paths mean the fish can seem to go back and forth between different ongoing eddies. My suspicion (based on a similar study) is that they are more likely to stay within a strong anticyclonic eddy system. Radius (and thus area) of eddy controls overlap possibly more than position. For a 1.5 month section I found the highest-overlap eddy track number bounced between 3 eddies. Eddy track number 217008 is the most powerfully anticyclonic. Here's a (crap) plot:
As you can see, 217008 is right in the centre of the fish error box whereas the larger, more diffuse, lower powered eddies 217343 and 39625 are off to the edge. However their larger size often sees them bumped to the top because they have more area to give, and centroid proximity isn't counted.
So: if fishbox overlaps with eddy on same day then include eddy in shortlist
(
thisFishRow
& thisEddyRow
stay the same). Then: Choose eddy from shortlist based on lowest distance from centroids (something instead of st_join
). To be continued!
Edit5:
fishNearEddies <- NULL
for (thisDate in sf_df_nona$date) {
thisFishRow <- sf_df_nona[sf_df_nona$date == thisDate, ]
thisEddyRow <- sebDateSub[sebDateSub$date == thisDate, ]
overlapToday <- st_join(thisFishRow, thisEddyRow, left = F)
if (nrow(overlapToday) > 0) {
overlapEddies <- thisEddyRow[thisEddyRow$track %in% overlapToday$track,]
st_distance(x = thisFishRow, y = overlapEddies, by_element = TRUE) #0 0 due to overlap min dist = 0
} #close if
} #close for
st_distance won't work because the spatial features (buffer box & buffer circle) overlap, so min distance = 0. I ALSO need centroids to test against, I presume?
Edit6: Final Edit with working code, & answer award to Stewart for all his help. Thanks again sir.
# overlap join loop####
fishNearEddies <- NULL
EdCentroidDist <- NULL
counter <- 1
ofhowmany <- length(sf_df_nona$date)
for (thisDate in sf_df_nona$date) {
thisFishRow <- sf_df_nona[sf_df_nona$date == thisDate, ] #will be 1 row per day
thisEddyRow <- sebDateSub[sebDateSub$date == thisDate, ] #all eddies that day, could be multi rows
overlapToday <- st_join(thisFishRow, thisEddyRow, left = F) #will join only if they spatial intersect and (already) time match
if (nrow(overlapToday) > 0) {
# now need to join based on closest centroid and NOT on highest overlap
overlapEddies <- thisEddyRow[thisEddyRow$track %in% overlapToday$track,] #subset TER by overlap tracks
fishcentroid <- st_centroid(thisFishRow)
eddycentroid <- st_centroid(overlapEddies)
fishEdDists <- st_distance(x = fishcentroid, y = eddycentroid, by_element = TRUE) #result vector corresponding to overlapEddies rownumbers
fishNearEddies <- rbind(fishNearEddies, overlapEddies[which.min(fishEdDists),]) #row index for overlapEddies, no index but has date
EdCentroidDist <- rbind(EdCentroidDist, as.numeric(min(fishEdDists)))
} #close if
print(paste0(counter, " of ", ofhowmany, " fish days"))
counter <- counter + 1
} #close for
if (!is.null(fishNearEddies)) { #if there are overlaps, do processing. If not it'll fail
fishNearEddies %<>% as.data.frame() %>% # convert to nonspatial so I can remove buffer column
select(-geometry) %>% # remove geometry column. Attributes still remain. Whatever?
cbind(EdCentroidDist) #add to FNE as column
Upvotes: 2
Views: 744
Reputation: 2142
Here's something that should get you going:
library(lubridate)
library(sf)
library(ggplot2)
# sample data
fish <- data.frame(
lat = c(41.1, 43.6, 44.2),
lon = c(-7, -11, -15),
date = ymd(c("1990-01-01", "1990-01-02", "1990-01-03"))
)
# Add a blank column for the eddy values we want
fish$vals <- rep(NA, nrow(fish))
eddy <- data.frame(
lat = c(44, 42.3, 40),
lon = c(-6, -10.1, -15),
date = ymd(c("1990-01-01", "1990-01-02", "1990-01-03")),
track = c(1,1,1),
radius = c(81000,82000,83000),
vals = c(11,12,13)
)
# Convert eddy to simple features, using WGS84 CRS
sf_eddy <- st_as_sf(eddy, coords = c("lon", "lat"), crs=4326)
# Transform from geographical to projected so we can buffer correctly. You might need to pick a different CRS
sf_eddy <- st_transform(sf_eddy, 3035) # ETRS89 / LAEA Europe
# Buffer the eddy points based on their radii
sf_eddy_buffered <- st_buffer(sf_eddy, dist = sf_eddy$radius)
# Add error to fish position. There's probably a better way to do this.
fishErrLat <- 1.9
fishErrLon <- 0.77
fish$buffer <- paste('POLYGON((',
fish$lon - fishErrLon, ' ', fish$lat + fishErrLat, ',',
fish$lon + fishErrLon, ' ', fish$lat + fishErrLat, ',',
fish$lon + fishErrLon, ' ', fish$lat - fishErrLat, ',',
fish$lon - fishErrLon, ' ', fish$lat - fishErrLat, ',',
fish$lon - fishErrLon, ' ', fish$lat + fishErrLat,
'))',
sep=''
)
# Convert fish to simple features, using WGS84 CRS
sf_fish <- st_as_sf(fish, wkt='buffer', crs=4326)
# Transform from geographical to projected
sf_fish <- st_transform(sf_fish, 3035)
#Plot what we've got so far
g <- ggplot() +
geom_sf(data=sf_eddy_buffered, aes(fill=date)) +
geom_sf(data=sf_fish, aes(fill=date))
print(g)
# Check for overlap
fishNearEddies <- NULL
for (thisDate in unique(sf_fish$date)) {
thisFishRow <- sf_fish[sf_fish$date==thisDate, ]
thisEddyRow <- sf_eddy_buffered[sf_eddy_buffered$date==thisDate, ]
v <- st_join(thisFishRow, thisEddyRow, left=F)
if (nrow(v) > 0) {
fishNearEddies <- rbind(fishNearEddies, v)
}
}
And check the results:
> fishNearEddies
Simple feature collection with 1 feature and 7 fields
geometry type: POLYGON
dimension: XY
bbox: xmin: 2588691 ymin: 2402830 xmax: 2703342 ymax: 2624501
epsg (SRID): 3035
proj4string: +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
lat lon date.x date.y track radius vals buffer
2 43.6 -11 1990-01-02 1990-01-02 1 82000 12 POLYGON ((2644792 2624501, ...
This will give you only those fish that intersect with an eddy.
Upvotes: 3