FRV
FRV

Reputation: 107

R: compare the next two values in a vector with each other (without looping if possible)

I have a vector like this:

10  7  7 10  7 10  7 10 10  7 10 10  7  7 10 10  7 10  7  7 10  7 10

I want to compare the entries of the vector in pairs: e.g. the first entry with the second, the third with the fourth until in a pair I have two equal entries. In this example two equal values occur for the first time in the sixth pair or in other words the 11th and 12th values are equal. IMPORTANT is now that I want to have the index of the 11th row and continue with the comparison between the 12th and 13th row.

Is there a good way to do this (I would prefer to do it without looping)?

EDIT: I really didn't explain myself clear enough. When a pair has equal values I would like to delete the first entry of these two values. So the indices of the pairs are not known from the start. In the above example, the desired output would be:

10  7  7 10  7 10  7 10 10  7 10  7  7 10 10  7 10  7  7 10  7 10

and the index of the row which has been deleted:

11

In this case only one row had to be deleted that all pairs consist of a 7 and a 10.

Upvotes: 2

Views: 2143

Answers (4)

josliber
josliber

Reputation: 44309

Based on the edited version of the question, it's now clear that you need some sort of a looping function, because your decisions on previous indices affect your decisions on subsequent indices. The most efficient way I can think to do this would be to populate a logical vector indicating whether each index should be kept in the vector. Afterward you can use the logical vector to get both the remaining values and the indices that were removed.

x <- c(10,  7,  7, 10,  7, 10,  7, 10, 10,  7, 10, 10,  7,  7, 10, 10, 7, 10,  7,  7, 10,  7, 10)
keep <- rep(TRUE, length(x))
even <- TRUE
for (pos in 2:length(x)) {
  if (even & x[pos] == x[pos-1]) {
    keep[pos-1] <- FALSE
  } else {
    even <- !even
  }
}
x[keep]
# [1] 10  7  7 10  7 10  7 10 10  7 10  7  7 10 10  7 10  7  7 10  7 10
which(!keep)
# [1] 11

As with any looping function, Rcpp can be used to get a speedup:

library(Rcpp)
cppFunction(
"LogicalVector getBin(NumericVector x) {
  const int n = x.size();
  LogicalVector keep(n, true);
  bool even = true;
  for (int pos=1; pos < n; ++pos) {
    if (even && x[pos] == x[pos-1]) {
      keep[pos-1] = false;
    } else {
      even = !even;
    }
  }
  return keep;
}")

Benchmarking of the pure-R and Rcpp approaches:

# Slightly larger dataset
set.seed(144)
x <- sample(1:10, 1000, replace=T)

# Functions to compare
pureR <- function(x) {
  keep <- rep(TRUE, length(x))
  even <- TRUE
  for (pos in 2:length(x)) {
    if (even & x[pos] == x[pos-1]) {
      keep[pos-1] <- FALSE
    } else {
      even <- !even
    }
  }
  list(x[keep], which(!keep))
}
with.Rcpp <- function(x) {
  keep <- getBin(x)
  list(x[keep], which(!keep))
}
all.equal(pureR(x), with.Rcpp(x))
# [1] TRUE
library(microbenchmark)
microbenchmark(pureR(x), with.Rcpp(x))
# Unit: microseconds
#          expr     min       lq       mean   median       uq       max neval
#      pureR(x) 855.318 1066.177 1806.67855 1140.656 1442.869 35379.369   100
#  with.Rcpp(x)  30.137   62.304   86.80656   78.132   94.771   348.598   100

With a vector of length 1000 we see a speedup of more than 10x from using Rcpp. Obviously this speedup would only be relevant for much larger vectors.

Upvotes: 2

josliber
josliber

Reputation: 44309

You could extract the odd- and even-numbered entries in the vector and compare:

x=c(10,  7,  7, 10,  7, 10,  7, 10, 10,  7, 10, 10,  7,  7, 10, 10, 7, 10,  7,  7, 10,  7, 10,12)
x[seq(1, length(x), 2)] == x[seq(2, length(x), 2)]
# [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE

This will be a good deal faster than grouping by pairs and making each comparison individually:

# Slightly larger dataset
set.seed(144)
x <- sample(1:10, 1000, replace=T)

# Grouping solution from @user7598's post
josilber <- function(x) x[seq(1, length(x), 2)] == x[seq(2, length(x), 2)]
user7598 <- function(x) tapply(x, (seq_along(x)-1) %/%2 +1, function(y) y[1]==y[2])
all.equal(josilber(x), unname(as.vector(user7598(x))))
# [1] TRUE

# Compare speed on 1000-length vector
library(microbenchmark)
microbenchmark(josilber(x), user7598(x))
# Unit: microseconds
#         expr      min       lq     mean   median       uq        max neval
#  josilber(x)   74.350  109.319  223.102  164.961  242.236   2411.465   100
#  user7598(x) 2271.347 2440.235 5040.763 3119.307 5356.552 110777.522   100

We see a 20x speedup on a vector of length 1000. This is because comparing the odd indices to the even takes advantage of vectorization -- it makes a single call to == with all the data that needs to be compared. Meanwhile if you group and then compare for each smaller groups you will make many calls to == on smaller vectors.

Upvotes: 4

akrun
akrun

Reputation: 886968

You may also try

 f1 <- function(v){
  if(length(v)%%2!=0)
    v <- v[-length(v)]
   m1 <- matrix(v, nrow=2)
   m1[1,] == m1[2,]
 }

 f1(v1)
 #[1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE

Benchmarks

set.seed(144)
x <- sample(1:10, 1000, replace=T)
library(microbenchmark)
microbenchmark(josilber(x), akrun=f1(x), unit='relative', times=20L)
#Unit: relative
#    expr      min       lq     mean  median       uq      max neval cld
#josilber(x) 4.791352 4.768276 4.675041 4.64354 4.474515 5.340249    20   b
#      akrun 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000    20  a 

identical(josilber(x), f1(x))
#[1] TRUE

data

v1 <- c(10, 7, 7, 10, 7, 10, 7, 10, 10, 7, 10, 10, 7, 7, 10, 10, 7,
    10, 7, 7, 10, 7, 10)

Upvotes: 4

User7598
User7598

Reputation: 1678

If you create an index for the pairs, you can use tapply. For example:

x=c(10,  7,  7, 10,  7, 10,  7, 10, 10,  7, 10, 10,  7,  7, 10, 10, 7, 10,  7,  7, 10,  7, 10,12) #note the addition of "12" to create an even number of pairs.
pair=(seq_along(v1)-1) %/%2 +1 #create an index for the pairs. Thanks to @akrun for this bit of code
tapply(x,pair,function(x) x[1]==x[2])
#    1     2     3     4     5     6     7     8     9    10    11    12 
#FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE 

The result returns a TRUE or FALSE value that corresponds to if the values for the pairs matched.

Note the index won't work if you don't have an even number in the vector (i.e., incomplete pairs) so I added a number to your example).

Upvotes: 1

Related Questions