Reputation: 107
I have a vector like this:
10 7 7 10 7 10 7 10 10 7 10 10 7 7 10 10 7 10 7 7 10 7 10
I want to compare the entries of the vector in pairs: e.g. the first entry with the second, the third with the fourth until in a pair I have two equal entries. In this example two equal values occur for the first time in the sixth pair or in other words the 11th and 12th values are equal. IMPORTANT is now that I want to have the index of the 11th row and continue with the comparison between the 12th and 13th row.
Is there a good way to do this (I would prefer to do it without looping)?
EDIT: I really didn't explain myself clear enough. When a pair has equal values I would like to delete the first entry of these two values. So the indices of the pairs are not known from the start. In the above example, the desired output would be:
10 7 7 10 7 10 7 10 10 7 10 7 7 10 10 7 10 7 7 10 7 10
and the index of the row which has been deleted:
11
In this case only one row had to be deleted that all pairs consist of a 7 and a 10.
Upvotes: 2
Views: 2143
Reputation: 44309
Based on the edited version of the question, it's now clear that you need some sort of a looping function, because your decisions on previous indices affect your decisions on subsequent indices. The most efficient way I can think to do this would be to populate a logical vector indicating whether each index should be kept in the vector. Afterward you can use the logical vector to get both the remaining values and the indices that were removed.
x <- c(10, 7, 7, 10, 7, 10, 7, 10, 10, 7, 10, 10, 7, 7, 10, 10, 7, 10, 7, 7, 10, 7, 10)
keep <- rep(TRUE, length(x))
even <- TRUE
for (pos in 2:length(x)) {
if (even & x[pos] == x[pos-1]) {
keep[pos-1] <- FALSE
} else {
even <- !even
}
}
x[keep]
# [1] 10 7 7 10 7 10 7 10 10 7 10 7 7 10 10 7 10 7 7 10 7 10
which(!keep)
# [1] 11
As with any looping function, Rcpp can be used to get a speedup:
library(Rcpp)
cppFunction(
"LogicalVector getBin(NumericVector x) {
const int n = x.size();
LogicalVector keep(n, true);
bool even = true;
for (int pos=1; pos < n; ++pos) {
if (even && x[pos] == x[pos-1]) {
keep[pos-1] = false;
} else {
even = !even;
}
}
return keep;
}")
Benchmarking of the pure-R and Rcpp approaches:
# Slightly larger dataset
set.seed(144)
x <- sample(1:10, 1000, replace=T)
# Functions to compare
pureR <- function(x) {
keep <- rep(TRUE, length(x))
even <- TRUE
for (pos in 2:length(x)) {
if (even & x[pos] == x[pos-1]) {
keep[pos-1] <- FALSE
} else {
even <- !even
}
}
list(x[keep], which(!keep))
}
with.Rcpp <- function(x) {
keep <- getBin(x)
list(x[keep], which(!keep))
}
all.equal(pureR(x), with.Rcpp(x))
# [1] TRUE
library(microbenchmark)
microbenchmark(pureR(x), with.Rcpp(x))
# Unit: microseconds
# expr min lq mean median uq max neval
# pureR(x) 855.318 1066.177 1806.67855 1140.656 1442.869 35379.369 100
# with.Rcpp(x) 30.137 62.304 86.80656 78.132 94.771 348.598 100
With a vector of length 1000 we see a speedup of more than 10x from using Rcpp. Obviously this speedup would only be relevant for much larger vectors.
Upvotes: 2
Reputation: 44309
You could extract the odd- and even-numbered entries in the vector and compare:
x=c(10, 7, 7, 10, 7, 10, 7, 10, 10, 7, 10, 10, 7, 7, 10, 10, 7, 10, 7, 7, 10, 7, 10,12)
x[seq(1, length(x), 2)] == x[seq(2, length(x), 2)]
# [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE
This will be a good deal faster than grouping by pairs and making each comparison individually:
# Slightly larger dataset
set.seed(144)
x <- sample(1:10, 1000, replace=T)
# Grouping solution from @user7598's post
josilber <- function(x) x[seq(1, length(x), 2)] == x[seq(2, length(x), 2)]
user7598 <- function(x) tapply(x, (seq_along(x)-1) %/%2 +1, function(y) y[1]==y[2])
all.equal(josilber(x), unname(as.vector(user7598(x))))
# [1] TRUE
# Compare speed on 1000-length vector
library(microbenchmark)
microbenchmark(josilber(x), user7598(x))
# Unit: microseconds
# expr min lq mean median uq max neval
# josilber(x) 74.350 109.319 223.102 164.961 242.236 2411.465 100
# user7598(x) 2271.347 2440.235 5040.763 3119.307 5356.552 110777.522 100
We see a 20x speedup on a vector of length 1000. This is because comparing the odd indices to the even takes advantage of vectorization -- it makes a single call to ==
with all the data that needs to be compared. Meanwhile if you group and then compare for each smaller groups you will make many calls to ==
on smaller vectors.
Upvotes: 4
Reputation: 886968
You may also try
f1 <- function(v){
if(length(v)%%2!=0)
v <- v[-length(v)]
m1 <- matrix(v, nrow=2)
m1[1,] == m1[2,]
}
f1(v1)
#[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
set.seed(144)
x <- sample(1:10, 1000, replace=T)
library(microbenchmark)
microbenchmark(josilber(x), akrun=f1(x), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval cld
#josilber(x) 4.791352 4.768276 4.675041 4.64354 4.474515 5.340249 20 b
# akrun 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 20 a
identical(josilber(x), f1(x))
#[1] TRUE
v1 <- c(10, 7, 7, 10, 7, 10, 7, 10, 10, 7, 10, 10, 7, 7, 10, 10, 7,
10, 7, 7, 10, 7, 10)
Upvotes: 4
Reputation: 1678
If you create an index for the pairs, you can use tapply
. For example:
x=c(10, 7, 7, 10, 7, 10, 7, 10, 10, 7, 10, 10, 7, 7, 10, 10, 7, 10, 7, 7, 10, 7, 10,12) #note the addition of "12" to create an even number of pairs.
pair=(seq_along(v1)-1) %/%2 +1 #create an index for the pairs. Thanks to @akrun for this bit of code
tapply(x,pair,function(x) x[1]==x[2])
# 1 2 3 4 5 6 7 8 9 10 11 12
#FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE
The result returns a TRUE or FALSE value that corresponds to if the values for the pairs matched.
Note the index won't work if you don't have an even number in the vector (i.e., incomplete pairs) so I added a number to your example).
Upvotes: 1