stats_noob
stats_noob

Reputation: 5925

All Combinations of Numbers That Meet a Condition

I am working with the R programming language.

Suppose I have numbers: 2010,2011,2012,2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020

My Question: I want to find out all possible pairs of these numbers where the difference is greater than 1. E.g. (2013-2010), (2019-2011), etc.

Currently, I am using a very clunky way to do this:

x = 2010:2020
grid = expand.grid(x,x)
grid$diff = grid$Var1 - grid$Var2
grid$condition = ifelse(grid$Var1 - grid$Var2 > 1, "yes", "no")

final = grid[grid$condition == "yes",]

Is there a more efficient way to do this?

For example, here I first generate all possible combinations and then eliminate the invalid combinations. This might be ineffective when there are a large number of numbers. Is there a better way to do this?

Thanks!

Upvotes: 3

Views: 183

Answers (6)

Khlick
Khlick

Reputation: 360

In your problem, the objective is to obtain the combinations of numbers in the vector such that the difference of the pair is greater than 1, but you don't want to make unnecessary comparisons that waste processing time. While I like Ritchie Sacramento's answer for contiguous sequences, I wanted to provide an answer that can handle gaps in the sequence, unsorted sequences, and non-integer sequences. The only way I can think to make this more 'efficient' is to iterate over the sequence and gather all the numbers that are at least 1 greater than the current iterand.

In my approach, I first enforce the sequence is sorted. Then I loop over each element in the sequence and compare only the indices after the current one. If these meet the difference criteria, I add them to a running list of pairs. This assumes you don't have an repeated numbers, otherwise you might want to first run unique().

#' Find pairs in a sequence with a difference greater than a specified offset.
#'
#' This function sorts the input sequence, and then iterates through the sorted sequence 
#' to find pairs of numbers with a difference greater than the specified difference offset. 
#' Pairs are returned in a matrix where pair[1] > pair[2].
#'
#' @param x A numeric vector representing the input sequence.
#' @param differenceOffset A numeric value representing the minimum difference between pairs. Default is 1.
#' @return A matrix where each row represents a pair with a difference greater than the difference offset.
#' @examples
#' find_pairs(c(1,2,3,4,5))  # uses default differenceOffset of 1
#' find_pairs(c(2,4,6,8,10), 2)
find_pairs <- function(x, differenceOffset = 1) {
  # sort the sequence
  x_sorted <- sort(x)
  len <- length(x_sorted)

  # initialize a list to store the pairs
  pairs <- list()

  # iterate through the sorted sequence
  for(current in 1:(len - 1)) {
    # iterate and test the next index forward of the current
    for(test in (current + 1):len) {
      # if the difference is > differenceOffset, add the pair to the list
      if(x_sorted[test] - x_sorted[current] > differenceOffset) {
        # flip the order to put pair[1] > pair[2] in the output
        pairs <- c(pairs, list(c(x_sorted[test], x_sorted[current])))
      }
    }
  }

  # convert the list to a matrix
  do.call(rbind, pairs)
}

Here's a couple examples of the usage.

# unsorted example
> find_pairs(c(4,3,2,5,1))
     [,1] [,2]
[1,]    3    1
[2,]    4    1
[3,]    5    1
[4,]    4    2
[5,]    5    2
[6,]    5    3

# use difference offset of >2
> find_pairs(c(1,2,3,4,5), 2)
     [,1] [,2]
[1,]    4    1
[2,]    5    1
[3,]    5    2

# your example
> find_pairs(2010:2020)
      [,1] [,2]
 [1,] 2012 2010
 [2,] 2013 2010
 [3,] 2014 2010
 [4,] 2015 2010
 [5,] 2016 2010
 [6,] 2017 2010
 [7,] 2018 2010
 [8,] 2019 2010
 [9,] 2020 2010
[10,] 2013 2011
...
[40,] 2018 2016
[41,] 2019 2016
[42,] 2020 2016
[43,] 2019 2017
[44,] 2020 2017
[45,] 2020 2018

# gap example
> find_pairs(c(1:2, 6:8))
     [,1] [,2]
[1,]    6    1
[2,]    7    1
[3,]    8    1
[4,]    6    2
[5,]    7    2
[6,]    8    2
[7,]    8    6


# non-integer example
> find_pairs(seq(1, 3, by = 0.25), 1)
      [,1] [,2]
 [1,] 2.25 1.00
 [2,] 2.50 1.00
 [3,] 2.75 1.00
 [4,] 3.00 1.00
 [5,] 2.50 1.25
 [6,] 2.75 1.25
 [7,] 3.00 1.25
 [8,] 2.75 1.50
 [9,] 3.00 1.50
[10,] 3.00 1.75

EDIT

In terms of time complexity, the solution above might have a total time complexity [O(n log n + n)] less than your solution using expand.grid [O(n^2)] for long vectors.

jblood94's answer with benchmarking showed the above approach was not really efficient at all. In the original function, find_pairs, I attempted to spell the process out as explicitly as possible, which led to considerable computational overhead. It is often a fun challenge to write complicated one-liners that are the fast and flashy, but this is almost always at the cost of being almost entirely illegible post hoc. Of course, using a package like data.table that is designed specifically for optimizing this kind of problem would be the best solution. But, maintaining the same integrity of clear and concise code, and using base R, I submit a revision below in find_pairs.2.

In this version I follow a similar thread, but I make a few optimizations. First, I allocate a large matrix for the pairs rather than an empty list. Though this is less memory efficient, it appears to affect the timing. Second, I utilize a while-loop to find the the first index whose value exceeds the differenceOffset. Once we find this index, and because we have sorted the array, we know all values including and beyond it must be greater than differenceOffset. I then simply set the current value (through scalar expansion) alongside the the values larger than our test condition directly into the matrix into the next available positions (determined by a pair_counter). Finally, I return trimmed matrix.

find_pairs.2 <- function(x, differenceOffset = 1) {
  # sort the sequence
  x <- sort(x)
  len <- length(x)

  # initialize a matrix to store the pairs
  # assume that in the worst case, we have len*(len-1)/2 pairs
  max_pairs <- len * (len - 1) / 2
  pairs <- matrix(nrow = max_pairs, ncol = 2)

  # initialize a counter for the number of pairs found
  pair_count <- 0

  # iterate through the sorted sequence
  for (current in 1:(len - 1)) {
    # find the first index (test) that is strictly greater than differenceOffset
    test <- current + 1
    while (test <= len && x[test] - x[current] <= differenceOffset) {
      test <- test + 1
    }
    
    # if test is within bounds, add all pairs from test to len to the matrix
    if (test <= len) {
      num_new_pairs <- len - test + 1
      pairs[(pair_count + 1):(pair_count + num_new_pairs), 1] <- x[test:len]
      pairs[(pair_count + 1):(pair_count + num_new_pairs), 2] <- x[current]
      pair_count <- pair_count + num_new_pairs
    }
  }

  # trim unused slots in the matrix
  return(pairs[1:pair_count, ])
}

If you really want to speed up your calculations, you can use the Rcpp package and compile a C++ version of the new solution (benchmarked below).

find_pairs.cpp

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix find_pairs_cpp(NumericVector x, int differenceOffset = 1) {
  // Sort the sequence
  std::sort(x.begin(), x.end());

  int len = x.size();

  // Initialize a matrix to store the pairs
  // Assume that in the worst case, we have len*(len-1)/2 pairs
  int max_pairs = len * (len - 1) / 2;
  NumericMatrix pairs(max_pairs, 2);

  // Initialize a counter for the number of pairs found
  int pair_count = 0;

  // Iterate through the sorted sequence
  for (int current = 0; current < len - 1; ++current) {
    // Find the first index (test) that is strictly greater than differenceOffset
    int test = current + 1;
    while (test < len && x[test] - x[current] <= differenceOffset) {
      ++test;
    }
    
    // If test is within bounds, add all pairs from test to len to the matrix
    if (test < len) {
      int num_new_pairs = len - test;
      for (int j = 0; j < num_new_pairs; ++j) {
        pairs(pair_count + j, 0) = x[test + j];
        pairs(pair_count + j, 1) = x[current];
      }
      pair_count += num_new_pairs;
    }
  }

  // Trim unused slots in the matrix
  NumericMatrix result = pairs(Range(0, pair_count - 1), _);
  return result;
}

Then, source the C++ function like so:

# Load Rcpp package
library(Rcpp)

# Source the C++ script
sourceCpp("find_pairs.cpp")

# Now you can use your C++ function in R
x <- 2010:2020
find_pairs_cpp(x)

Benchmarking

As a side note. I benchmarked this against the functions defined in jblood94's answer and see remarkable improvement, especially since the original find_pairs function didn't even survive in the larger test!

# "small" vector
x <- runif(1e2, 0, 10)

bm_small <- microbenchmark::microbenchmark(
  expand.grid = nrow(f0(x)),
  find_pairs.2 = nrow(find_pairs.2(x)),
  find_pairs_cpp = nrow(find_pairs_cpp(x)),
  f1 = nrow(f1(x)),
  f2 = nrow(f2(x)),
  f3 = nrow(f3(x)),
  f4 = nrow(f4(x)),
  check = "equal",
  times = 1000
)
> print(bm_small)
Unit: microseconds
           expr    min      lq      mean  median      uq     max neval
    expand.grid  776.5  827.80  878.1540  853.60  895.10  1559.9  1000
   find_pairs.2  394.3  432.10  462.8021  449.40  476.60  1042.6  1000
 find_pairs_cpp   28.1   39.60   48.8424   43.30   48.20   138.5  1000    
             f1  769.2  823.30  903.5507  847.10  885.60 21644.9  1000
             f2  657.8  700.10  832.8963  721.30  748.95 32129.7  1000
             f3 3387.0 3508.05 3768.5233 3565.00 3655.65 50499.4  1000
             f4 2511.5 2608.00 2770.2623 2650.95 2720.60 26553.3  1000
x <- runif(1e4, 0, 10)

bm_large <- microbenchmark::microbenchmark(
  expand.grid = nrow(f0(x)),
  find_pairs.2 = nrow(find_pairs.2(x)), # runs for several minutes without completing
  f1 = nrow(f1(x)),
  f2 = nrow(f2(x)),
  f3 = nrow(f3(x)),
  f4 = nrow(f4(x)),
  check = "equal",
  times = 10
)
> print(bm_large)
Unit: milliseconds
         expr       min        lq      mean    median        uq      max neval
  expand.grid   5344.5752 5551.6006 5640.7089 5689.3884 5750.9637 5767.5464 10
find_pairs.2    2531.3152 2574.1561 2679.6633 2643.0467 2788.3917 2879.4604 10
 find_pairs_cpp  405.0732  458.4121  505.8300  494.3591  531.5002  640.3842 10
             f1 2933.1794 2973.3883 3061.8112 3063.7269 3151.2178 3226.2590 10
             f2  906.5848  927.4680  979.4814  943.6236  962.3983 1177.1214 10
             f3  654.4493  699.1284  798.1042  787.8151  905.0556  958.2739 10
             f4  946.0565 1006.1900 1108.1919 1122.6784 1145.9541 1301.9947 10

Cheers!

Upvotes: 4

ThomasIsCoding
ThomasIsCoding

Reputation: 102309

A simple base R option with lapply

do.call(
    rbind,
    Filter(
        length,
        lapply(
            x,
            \(i) if (any(lg <- (i - x > 1))) cbind(i, j = x[lg])
        )
    )
)

gives

         i    j
 [1,] 2012 2010
 [2,] 2013 2010
 [3,] 2013 2011
 [4,] 2014 2010
 [5,] 2014 2011
 [6,] 2014 2012
 [7,] 2015 2010
 [8,] 2015 2011
 [9,] 2015 2012
[10,] 2015 2013
[11,] 2016 2010
[12,] 2016 2011
[13,] 2016 2012
[14,] 2016 2013
[15,] 2016 2014
[16,] 2017 2010
[17,] 2017 2011
[18,] 2017 2012
[19,] 2017 2013
[20,] 2017 2014
[21,] 2017 2015
[22,] 2018 2010
[23,] 2018 2011
[24,] 2018 2012
[25,] 2018 2013
[26,] 2018 2014
[27,] 2018 2015
[28,] 2018 2016
[29,] 2019 2010
[30,] 2019 2011
[31,] 2019 2012
[32,] 2019 2013
[33,] 2019 2014
[34,] 2019 2015
[35,] 2019 2016
[36,] 2019 2017
[37,] 2020 2010
[38,] 2020 2011
[39,] 2020 2012
[40,] 2020 2013
[41,] 2020 2014
[42,] 2020 2015
[43,] 2020 2016
[44,] 2020 2017
[45,] 2020 2018

Another base R option could be using combn

do.call(
    rbind,
    Filter(
        length,
        combn(x[order(-x)],
            2,
            \(...) if (diff(...) < -1) data.frame(t(...)),
            simplify = FALSE
        )
    )
)

which gives

     X1   X2
1  2020 2018
2  2020 2017
3  2020 2016
4  2020 2015
5  2020 2014
6  2020 2013
7  2020 2012
8  2020 2011
9  2020 2010
10 2019 2017
11 2019 2016
12 2019 2015
13 2019 2014
14 2019 2013
15 2019 2012
16 2019 2011
17 2019 2010
18 2018 2016
19 2018 2015
20 2018 2014
21 2018 2013
22 2018 2012
23 2018 2011
24 2018 2010
25 2017 2015
26 2017 2014
27 2017 2013
28 2017 2012
29 2017 2011
30 2017 2010
31 2016 2014
32 2016 2013
33 2016 2012
34 2016 2011
35 2016 2010
36 2015 2013
37 2015 2012
38 2015 2011
39 2015 2010
40 2014 2012
41 2014 2011
42 2014 2010
43 2013 2011
44 2013 2010
45 2012 2010

Upvotes: 2

jblood94
jblood94

Reputation: 17011

A dist solution, an efficient base solution using loops, and an efficient data.table solution all benchmarked against some of the solutions proposed so far.

A dist solution:

library(parallelDist)

f1 <- function(x, mindiff = 1) {
  n <- 2*length(x)
  d <- parDist(as.matrix(x))
  y <- which(d > mindiff)
  i <- (n + 1 - sqrt((n - 1)^2 - 8*(y - 1)))%/%2
  data.frame(Var1 = x[i], Var2 = x[i + y - (n - i)*(i - 1)/2], diff = d[y])
}

A memory-efficient base solution with breaking for loops:

f2 <- function(x, mindiff = 1) {
  x <- sort(x)
  n <- length(x)
  n1 <- n + 1L
  idx <- rep(n1, n - 1L)
  j <- 2L
  
  for (i in 1:(n - 1L)) {
    for (j in j:n) {
      if (x[j] - x[i] > mindiff) {
        idx[i] <- j
        break
      }
    }
    
    if (idx[i] == n1) break
  }
  
  span <- n - idx + 1L
  within(
    out <- data.frame(
      Var1 = x[rep.int(1:(n - 1L), span)], Var2 = x[sequence(span, idx)]
    ),
    diff <- Var2 - Var1
  )
}

A vectorized version of the previous solution:

library(data.table)

f3 <- function(x, mindiff = 1) {
  x <- sort(x)
  n <- length(x)

  setorder(
    data.table(
      x = c(x, x + mindiff), idx1 = rep(1:n, 2), keep = rep(0:1, each = n)
    ), x
  )[,idx2 := cummax(idx1)][
    keep == 1L, {
      span <- n - idx2
      .(
        Var1 = x[rep.int(idx1, span)],
        Var2 = x[sequence(span, idx2 + 1L)]
      )
    }
  ][,diff := Var2 - Var1]
}

Define functions for other solutions proposed so far:

f0 <- function(x, mindiff = 1) {
  # modified from OP
  grid = expand.grid(x, x)
  out <- grid[grid$Var1 - grid$Var2 > mindiff,]
  out$diff <- out$Var1 - out$Var2
  out
}

f4 <- function(x, mindiff = 1) {
  # from @thelatemail
  xd <- data.table(x)
  xd[, xp1 := x + mindiff]
  xd[xd, on=.(x>xp1), nomatch=0L, .(Var1=x.x, Var2=i.x, diff=x.x-i.x)]
}

find_pairs <- function(x, differenceOffset = 1) {
  # from @Khlick
  # sort the sequence
  x_sorted <- sort(x)
  len <- length(x_sorted)
  
  # initialize a list to store the pairs
  pairs <- list()
  
  # iterate through the sorted sequence
  for(current in 1:(len - 1)) {
    # iterate and test the next index forward of the current
    for(test in (current + 1):len) {
      # if the difference is > differenceOffset, add the pair to the list
      if(x_sorted[test] - x_sorted[current] > differenceOffset) {
        # flip the order to put pair[1] > pair[2] in the output
        pairs <- c(pairs, list(c(x_sorted[test], x_sorted[current])))
      }
    }
  }
  
  # convert the list to a matrix
  do.call(rbind, pairs)
}

Benchmark with a smallish dataset:

x <- runif(1e2, 0, 10)

microbenchmark::microbenchmark(
  expand.grid = nrow(f0(x)),
  loop1 = nrow(find_pairs(x)),
  dist = nrow(f1(x)),
  loop2 = nrow(f2(x)),
  data.table1 = nrow(f3(x)),
  data.table2 = nrow(f4(x)),
  check = "equal"
)
#> Unit: microseconds
#>         expr     min       lq      mean   median       uq      max neval
#>  expand.grid   606.9   673.15   746.301   705.05   754.60   3534.9   100
#>        loop1 66948.0 68710.55 71555.309 70563.85 73099.55 113492.4   100
#>         dist   534.5   578.20   634.820   620.00   671.80    898.3   100
#>        loop2   319.0   393.90   466.185   435.85   514.00    920.0   100
#>  data.table1  1870.9  1980.05  2158.900  2079.15  2302.05   3229.2   100
#>  data.table2  1804.9  2032.10  2746.983  2719.85  3284.35   5810.3   100

Benchmark with a much larger dataset:

x <- runif(1e4, 0, 10)

microbenchmark::microbenchmark(
  expand.grid = nrow(f0(x)),
  # loop1 = nrow(find_pairs(x)), # runs for several minutes without completing
  dist = nrow(f1(x)),
  loop2 = nrow(f2(x)),
  data.table1 = nrow(f3(x)),
  data.table2 = nrow(f4(x)),
  check = "equal",
  times = 10
)
#> Unit: milliseconds
#>         expr       min        lq      mean    median        uq       max neval
#>  expand.grid 4333.9824 4526.9577 4578.6956 4580.5573 4625.4831 4792.6571    10
#>         dist 1996.4856 2121.0837 2134.5839 2161.4953 2168.5389 2215.8524    10
#>        loop2  667.5903  748.8159  838.5044  788.3667  930.6512 1043.5914    10
#>  data.table1  432.8631  570.3818  635.3953  633.7162  706.0260  791.4238    10
#>  data.table2 1592.9779 1619.0466 1736.3997 1764.8470 1826.4533 1872.4085    10

Upvotes: 3

lroha
lroha

Reputation: 34601

For a contiguous set of numbers as in your example, you can use rep() and sequence() to build the vectors. This should be about as efficient as it gets using base R.

f <- function(start, end, difference) {
  ds <- ((end - start) - difference + 1):1
  data.frame(
    v1 = sequence(ds, (start + difference):end),
    v2 = rep(start:(end - difference), ds)
    )
}

f(2010, 2020, 2)

     v1   v2
1  2012 2010
2  2013 2010
3  2014 2010
4  2015 2010
5  2016 2010
6  2017 2010
7  2018 2010
8  2019 2010
9  2020 2010
10 2013 2011
...
43 2019 2017
44 2020 2017
45 2020 2018

Upvotes: 3

thelatemail
thelatemail

Reputation: 93908

A data.table option using a non-equi join to do the comparison.
I've simplified your selection to make it a fairer comparison, as the ifelse sub-step is not required

library(data.table)
x = 2010:4500

system.time({
    xd <- data.table(x)
    xd[, xp1 := x + 1]
    out1 <- xd[xd, on=.(x>xp1), nomatch=0L, .(Var1=x.x, Var2=i.x, diff=x.x-i.x)]
})
##   user  system elapsed 
##   0.04    0.01    0.07 

system.time({
    grid = expand.grid(x,x)
    out2 <- grid[grid$Var1 - grid$Var2 > 1,]
    out2$diff <- out2$Var1 - out2$Var2
})
##   user  system elapsed 
##   0.23    0.08    0.32 

Results match up in every row:

all(mapply(\(x,y) all(x==y), out1, out2))
## [1] TRUE

Upvotes: 3

Mark
Mark

Reputation: 12548

A first attempt:

nums <- 2010:2020

combn(nums, 2)[,combn(nums, 2, diff) > 1]

Update: creating the pairs from scratch:

library(purrr)

f <- function(d, start, end){
    map(start:(end-d), \(x) c(x, x + d))
}

pairs <- function(min_diff, start, end){
  map(min_diff:(end-start), \(d) f(d, start, end)) |> 
    flatten()
}

pairs(2, 2010, 2020)

Upvotes: 3

Related Questions