estebangatillo
estebangatillo

Reputation: 63

Average points together without repeating and reduce final dataframe

The goal is to average points together within 10 meters without repeating any points in the averaging, reduce the point dataframe to the averaged points, and to ideally obtain a smooth flow of points along the routes said points were collected. Here is an 11 point subset example dataframe from a much larger file (25,000 observations):

library(sf)
df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)), 
                 datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
                 lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
                 lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
                 ) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,                         
               coords = c("lon", "lat"),
               crs = projcrs)

Here is what I've tried:

  1. Many iterations of st_is_within_distance(trav, trav, tolerance) including:
  2. an aggregate method shown here. These don't work because the same points get averaged multiple times.
  3. Got close with filter and across by trying to dynamically update a list in lapply but didn't work out in the end.
  4. This is helpful from @jeffreyevans, but doesn't really solve the problem and is a bit outdated.
  5. The spThin package doesn't work because it's made for more specific variables.
  6. I thought to cluster using this post, but the clusters throw random points and don't actually reduce the dataframe efficiently.

Here is as close as I've gotten. Again, the issue with this solution is it repeats points in collecting averages, which gives more weight to certain points than others.

  # first set tolerance
  tolerance <- 20 # 20 meters
  
  # get distance between points
  i <- st_is_within_distance(df, df, tolerance)
  
  # filter for indices with more than 1 (self) neighbor
  i <- i[which(lengths(i) > 1)]   
 
  # filter for unique indices (point 1, 2 / point 2, 1)
  i <- i[!duplicated(i)]
    
  # points in `sf` object that have no neighbors within tolerance
  no_neighbors <- trav[!(1:nrow(df) %in% unlist(i)),  ]  

  # iterate over indices of neighboring points
  avg_points <- lapply(i, function(b){
    df <- df[unlist(b), ]
    coords <- st_coordinates(df)
    
    df <- df %>%
      st_drop_geometry() %>%
      cbind(., coords)
    
    df_sum <-  df %>%
      summarise(
        datetime = first(datetime),
        trait = mean(trait),
        X = mean(X),
        Y = mean(Y),
        .groups = 'drop') %>% 
        ungroup()

    return(df)
  }) %>% 
    
    bind_rows() %>% 
    st_as_sf(coords = c('X', 'Y'),
             crs = "+proj=longlat +datum=WGS84 +no_defs ") 

Upvotes: 4

Views: 245

Answers (4)

jblood94
jblood94

Reputation: 17001

If the goal is to not weight any point more than any other point in the cluster averages, it would be more balanced to use weighted averages rather than trying to force each cluster to contain a set of points unique from all other clusters.

One way to think of the below methodology is to "chop up" each observation and divvy up the pieces into clusters in such a way that the weight of the pieces in each cluster sums to 1.

This will probably be too expensive for 25k observations, so one option could be to perform this on overlapping or non-overlapping segments and stitch them together.

library(sf)
library(Rfast) # for the 'eachrow' function

df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)), 
                 datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
                 lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
                 lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,                         
               coords = c("lon", "lat"),
               crs = projcrs)
n <- nrow(df)
# sum the trait column for a sanity check after calculations
sumtrait <- sum(df$trait)

# first set tolerance
tolerance <- 20 # 20 meters
tol <- 1e-5 # tolerance for the weight matrix marginal sums
# create clusters of points grouped by circles centered at each point
i <- st_is_within_distance(df, df, tolerance)
# Initialize a matrix for the weight of each point within each cluster. The
# initial value represents an unweighted average for each cluster, so the row
# sums are not necessarily 1.
sz <- lengths(i)
w <- replace(matrix(0, n, n), unlist(sapply(1:n, function(x) i[[x]] + n*(x - 1))), rep.int(1/sz, sz))
# iteratively adjust the weights until the marginal sums all equal 1 (within
# tolerance)
marg <- rowSums(w)

while (max(abs(marg - 1)) > tol) {
  w <- w/marg
  marg <- colSums(w)
  w <- eachrow(w, marg, "/")
  marg <- rowSums(w)
}

df$trait <- colSums(w*df$trait)
print(df, n = nrow(df))
#> Simple feature collection with 11 features and 2 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -94.58476 ymin: 39.09248 xmax: -94.58459 ymax: 39.09291
#> CRS:           +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
#>       trait            datetime                   geometry
#> 1  91.37719 2021-08-06 15:08:43 POINT (-94.58463 39.09253)
#> 2  91.44430 2021-08-06 15:08:44 POINT (-94.58462 39.09262)
#> 3  91.31374 2021-08-06 15:08:46  POINT (-94.5846 39.09281)
#> 4  91.46755 2021-08-06 15:08:47 POINT (-94.58459 39.09291)
#> 5  91.64053 2021-08-06 15:43:17 POINT (-94.58464 39.09248)
#> 6  91.37719 2021-08-06 15:43:18 POINT (-94.58464 39.09255)
#> 7  91.44430 2021-08-06 15:43:19 POINT (-94.58464 39.09261)
#> 8  91.41380 2021-08-06 15:43:20 POINT (-94.58464 39.09266)
#> 9  91.41380 2021-08-06 15:43:21  POINT (-94.58466 39.0927)
#> 10 91.31880 2021-08-06 15:43:22  POINT (-94.5847 39.09273)
#> 11 91.31880 2021-08-06 15:43:23 POINT (-94.58476 39.09274)

# check that the sum of the "traits" column is unchanged
sum(df$trait) - sumtrait
#> [1] 4.875536e-07

UPDATE: If an exclusive grouping method is really needed, this implements a greedy algorithm:

avg_points <- numeric(nrow(df))
clusters <- vector("list", nrow(df))
currclust <- 0L
df$unused <- TRUE
for (cl in seq_along(df)) {
  if (sum(df$unused[i[[cl]]])) {
    currclust <- currclust + 1L
    avg_points[currclust] <- mean(df$trait[i[[cl]]][df$unused[i[[cl]]]])
    clusters[[currclust]] <- i[[cl]][df$unused[i[[cl]]]]
    df$unused[i[[cl]]] <- FALSE
  }
}

avg_points <- avg_points[1:currclust]
clusters <- clusters[1:currclust]
avg_points
#> [1] 91.34714 91.65000 91.40000
clusters
#> [[1]]
#> [1] 1 2 5 6 7 8 9
#> 
#> [[2]]
#> [1] 10 11
#> 
#> [[3]]
#> [1] 3 4

Note that the issue of uneven weightings is still present--the observations in group 1 are each weighted 1/7, while the observations in groups 2 and 3 are each weighted 1/2.

Upvotes: 0

thothal
thothal

Reputation: 20409

If I understand your problem correctly, all boils down to selecting the "right" neighbors, i.e. those within a certain neighborhood, which were not used yet. If there is no such neighbor, simply use the point itself (even if it was used already in the averaging for another point).

Here's a solution using purrr::accumulate to first produce the correct indices and then simply use these indices to do the averaging:

library(purrr)
library(dplyr)

idx <- accumulate(i[-1L], function(x, y) {
   x$points <- setdiff(y, x$used)
   x$used <- union(x$used, y)
   x
}, .init = list(used = i[[1L]], points = i[[1L]]))

idx[1:4]

# [[1]]
# [[1]]$used
# [1] 1 2 5 6 7 8 9
# 
# [[1]]$points
# [1] 1 2 5 6 7 8 9
#
# 
# [[2]]
# [[2]]$used
# [1]  1  2  5  6  7  8  9 10 11
# 
# [[2]]$points
# [1] 10 11
#
#
# [[3]]
# [[3]]$used
# [1]  1  2  5  6  7  8  9 10 11  3  4
# 
# [[3]]$points
# [1] 3 4
#
#
# [[4]]
# [[4]]$used
# [1]  1  2  5  6  7  8  9 10 11  3  4
#
# [[4]]$points
# integer(0)

The idea is that we maintain a list of used indices, that is, the ones which already used in any of the neighborhoods and the remainders (points). For instance, for the first point we use points at indices 1,2, 5, 6, 7, 8, 9 which leaves only indices 10, 11 for the second point. If there is no point left, we return integer(0).

Now that we have set up the indices list, the rest is easy, by looping through the list, selecting the indicated points (using the point itself in case there is no point left) and doing the avering:

idx %>% 
   imap_dfr(function(x, y) {
      if (!length(x$points)) {
         idx <- y
      } else {
         idx <- x$points
      }
      df[idx, , drop = FALSE] %>% 
         bind_cols(st_coordinates(.) %>% as_tibble()) %>% 
         st_drop_geometry() %>% 
         summarise(datetime = first(datetime),
                   trait = mean(trait),
                   X = mean(X),
                   Y = mean(Y))
   }) %>% 
   st_as_sf(coords = c('X', 'Y'),
            crs = "+proj=longlat +datum=WGS84 +no_defs ") 

# Simple feature collection with 11 features and 2 fields
# Geometry type: POINT
# Dimension:     XY
# Bounding box:  xmin: -94.58476 ymin: 39.09248 xmax: -94.58459 ymax: 39.09291
# CRS:           +proj=longlat +datum=WGS84 +no_defs 
# First 10 features:
#               datetime    trait                   geometry
# 1  2021-08-06 15:08:43 91.34714 POINT (-94.58464 39.09259)
# 2  2021-08-06 15:43:22 91.65000 POINT (-94.58473 39.09274)
# 3  2021-08-06 15:08:46 91.40000  POINT (-94.5846 39.09286)
# 4  2021-08-06 15:08:47 91.58000 POINT (-94.58459 39.09291)
# 5  2021-08-06 15:43:17 91.47000 POINT (-94.58464 39.09248)
# 6  2021-08-06 15:43:18 92.19000 POINT (-94.58464 39.09255)
# 7  2021-08-06 15:43:19 92.19000 POINT (-94.58464 39.09261)
# 8  2021-08-06 15:43:20 90.57000 POINT (-94.58464 39.09266)
# 9  2021-08-06 15:43:21 90.57000  POINT (-94.58466 39.0927)
# 10 2021-08-06 15:43:22 91.65000  POINT (-94.5847 39.09273)

Upvotes: 0

mrhellmann
mrhellmann

Reputation: 5559

Another answer, using sf::aggregate() and a hexagonal grid to find points that are within a particular distance from each other. A square grid could be used as well. Results will vary some depending on where exactly the grid falls in relation to the points, but no point should be used more than once in determining the mean.

Outline of the steps:

  • load data, transform to crs 5070 for measurements in meters
  • get a bounding box of the data
  • make a grid of hexagons of the bounding box of ~10m diameter each
  • aggregate points falling in the same hexagon using mean
  • join to original data
library(sf)
library(tidyverse)

set.seed(22) # might be needed to get same hex grid?

#### your sample data
df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)), 
                 datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
                 lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
                 lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,                         
               coords = c("lon", "lat"),
               crs = projcrs) %>%
  st_transform(5070)  ### transform to 5070 for a projection in meters
#### end sample data


# Get a bounding box as an sf object to make a grid
bbox <- st_bbox(df) %>% st_as_sfc()

# Make a grid as hexagons with approximately the right size 
#  area ~86m; side ~5.75m; long diag ~11.5m 
hex_grid <- st_make_grid(bbox, cellsize = 10, square = F) %>% st_as_sf()

# Aggregate mean of the hexagonal grid
hex_agg <- aggregate(df ,
                     hex_grid,
                     mean,
                     join = st_contains) %>% filter(!is.na(trait))

# Assign the mean of the hexagon to points that fall 
#  within each hexagon
df_agg <- st_join(df, hex_agg)

head(df_agg) # trait.x from df, trait.y from the mean by hexagon
#> Simple feature collection with 6 features and 4 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 121281.6 ymin: 1786179 xmax: 121285.3 ymax: 1786227
#> Projected CRS: NAD83 / Conus Albers
#>   trait.x          datetime.x  trait.y          datetime.y
#> 1   91.22 2021-08-06 15:08:43 91.70500 2021-08-06 15:26:00
#> 2   91.22 2021-08-06 15:08:44 91.32667 2021-08-06 15:31:47
#> 3   91.22 2021-08-06 15:08:46 91.22000 2021-08-06 15:08:46
#> 4   91.58 2021-08-06 15:08:47 91.58000 2021-08-06 15:08:47
#> 5   91.47 2021-08-06 15:43:17 91.47000 2021-08-06 15:43:17
#> 6   92.19 2021-08-06 15:43:18 91.70500 2021-08-06 15:26:00
#>                   geometry
#> 1 POINT (121282.5 1786184)
#> 2 POINT (121283.2 1786194)
#> 3 POINT (121284.6 1786216)
#> 4 POINT (121285.3 1786227)
#> 5 POINT (121281.7 1786179)
#> 6 POINT (121281.6 1786186)

sum(df_agg$trait.x) - sum(df_agg$trait.y) # original trait - aggregate trait should be 0, or near 0
#> [1] 0

ggplot(df_agg) + 
  geom_sf(aes(size = trait.x), alpha = .2, color = 'blue') +  # Original triat
  geom_sf(aes(size = trait.y), alpha = .2, color = 'red') +  # New aggregated trait
  theme_void()

Sized by trait. Blue points are original, red is the new spatial mean.


## Plot of
# original points & hex grid used:
ggplot() + 
  geom_sf(data = df, color = 'red') + 
  geom_sf(data = hex_grid, fill = NA) +
  theme_void()

Plot showing the grouping of the points for the mean. Looks like there were groups of 1,2, and 3 points per hexagon for the mean.

Created on 2022-03-23 by the reprex package (v2.0.1)

Edit

Updated to have only one point per hexagon, losing some of the original points

## Edit for one point per hexagon:
hex_agg <- aggregate(df ,
                     hex_grid,
                     mean,
                     join = st_contains) %>% filter(!is.na(trait)) %>%
  rownames_to_column('hex_num')  # add hexagon number to group_by

## Guide to join on, has only hexagon number & centroid of contained points
hex_guide <- df_agg %>%
  group_by(hex_num) %>%
  summarise() %>%
  st_centroid()

# The full sf object with only one point per hexagon
#  this join isn't the most efficient, but slice(1) removes
#  the duplicate data. You could clean df_agg before the join
#  to resolve this
final_join <- df_agg %>%
                 st_drop_geometry() %>%
                 left_join(hex_guide, by = 'hex_num') %>% 
                 group_by(hex_num) %>%
                 slice(1) %>%
                 ungroup() %>%
                 st_as_sf()

ggplot() +
  geom_sf(data = final_join, color = 'red', size = 3) +
  geom_sf(data = df, color = 'black', alpha = .5) +
  geom_sf(data = hex_grid, color = 'blue', fill = NA)

The plot shows the hexagons, original data points in grey, and new red points at the centroid of grouped original points. Only 1 red point per hexagon.

enter image description here

Upvotes: 2

Wimpel
Wimpel

Reputation: 27792

I'm not sure, but perhaps this is what you are looking for?

You can experiment with the different settings/methods of smoothr::smooth() to get the desired results.

library(tidyverse)
library(igraph)
library(smoothr)
library(mapview)  # for viewing purposes only
# get a matrix of points <10 meter apart
m <- st_is_within_distance(df, dist = 10, sparse = FALSE)
# creata an igraph from the matrix
g <- graph.adjacency(m, mode="undirected", diag = FALSE)
plot(g)

points that are are withing 10 metres of eachother? enter image description here

# pass cluster-number to df object
df$id <- as.vector(components(G)$membership)
# create polylines (only if more than 1 point!)
df.lines <- df %>%
  group_by(id) %>%
  dplyr::add_tally() %>%
  dplyr::filter(n > 1) %>%
  summarise(geometry = st_combine(geometry)) %>%
  st_cast("LINESTRING") %>%
  # create smooth lines
  smoothr::smooth(method = "ksmooth")
#view points and lines
mapview::mapview(list(df, df.lines))

enter image description here

Upvotes: 0

Related Questions