merlin2011
merlin2011

Reputation: 75545

Vectorize comparison of a row vector with every row of a dataframe in R?

Suppose I have a data frame that comes from reading in the following file Foo.csv

A,B,C
1,2,3
2,2,4
1,7,3

I would like to count the number of matching elements between the first row and subsequent rows. For example, the first row matches with the second row in one position, and matches with the third row in two positions. Here is some code that will achieve the desired effect.

foo = read.csv("Foo.csv")                      

numDiffs = rep(0,dim(foo)[1])                  
for (i in 2:dim(foo)[1]) {                     
   numDiffs[i] = sum(foo[i,] == foo[1,])       
}                                              
print(numDiffs)                                

My question is, can this be vectorized to kill the loop and possibly reduce the running time? My first attempt is below, but it leaves an error because == is not defined for this type of comparison.

colSums(foo == foo[1,])

Upvotes: 4

Views: 370

Answers (4)

Simon O'Hanlon
Simon O'Hanlon

Reputation: 59970

eh, I don't see why you can't just do..

c(foo[1,]) == foo
#         A     B     C
#[1,]  TRUE  TRUE  TRUE
#[2,] FALSE  TRUE FALSE
#[3,]  TRUE FALSE  TRUE

.. or even better foo[1,,drop=TRUE] == foo...

Thus the result becomes...

rowSums( c( foo[1,] ) == foo[-1,] )
#[1] 3 1 2

Remember, f[1,] is still a data.frame. Coerce to a vector and == is defined for what you need. This seems to be a little quicker than the vapply answer suggested @AnandaMahto on a big dataframe.

Benchmarking

Comparing this against fun3 and fun4 from @AnandaMahto's answer above I see a small speed improvement when using the larger data.frame, my.df...

microbenchmark(fun3(mydf), fun4(mydf), fun6(mydf) , times = 20)
#Unit: milliseconds
#       expr      min       lq   median       uq      max neval
# fun3(mydf) 320.7485 344.9249 356.1657 365.7576 399.5334    20
# fun4(mydf) 299.6660 313.7105 319.1700 327.8196 555.4625    20
# fun6(mydf) 196.8244 241.4866 252.6311 258.8501 262.7968    20

fun6 is defined as...

fun6 <- function(data) rowSums( c( data[1,] ) == data )

Upvotes: 4

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

As your dataset grows larger, you might get a bit more speed with something like this:

as.vector(c(0, rowSums(foo[rep(1, nrow(foo) - 1), ] == foo[-1, ])))
# [1] 0 1 2

The basic idea is to create a data.frame of the first row the same dimensions of the overall dataset less one row, and use that to check for equivalence with the remaining rows.


Deleting my original update, here are some benchmarks instead. Change "N" to see the effect on different data.frame sizes. The solution from @nacnudus scales best.

set.seed(1)
N <- 10000000
mydf <- data.frame(matrix(sample(10, N, replace = TRUE), ncol = 10))
dim(mydf)
# [1] 1000000      10

fun1 <- function(data) rowSums(sapply(data, function(x) c(0,x[1] == x[2:nrow(data)])))
fun2 <- function(data) as.vector(c(0, rowSums(data[rep(1, nrow(data) - 1), ] == data[-1, ])))
fun3 <- function(data) {
  bar <- as.matrix(data)
  c(0, rowSums(t(t(bar[-1, ]) == bar[1, ])))
}

library(microbenchmark)

## On your original sample data
microbenchmark(fun1(foo), fun2(foo), fun3(foo))
# Unit: microseconds
#       expr     min       lq   median       uq     max neval
#  fun1(foo) 109.903 119.0975 122.5185 127.0085 228.785   100
#  fun2(foo) 333.984 354.5110 367.1260 375.0370 486.650   100
#  fun3(foo) 233.490 250.8090 264.7070 269.8390 518.295   100

## On the sample data created above--I don't want to run this 100 times!
system.time(fun1(mydf))
#    user  system elapsed 
#   15.53    0.06   15.60
system.time(fun2(mydf))
#    user  system elapsed 
#    2.05    0.01    2.06 
system.time(fun3(mydf))
#    user  system elapsed 
#    0.32    0.00    0.33 

HOWEVER, if Codoremifa were to change their code to vapply instead of sapply, that answer wins! From 15 seconds down to 0.24 seconds on 1 million rows.

fun4 <- function(data) {
  rowSums(vapply(data, function(x) c(0, x[1] == x[2:nrow(data)]), 
                 vector("numeric", length=nrow(data))))
} 

microbenchmark(fun3(mydf), fun4(mydf), times = 20)
# Unit: milliseconds
#        expr      min       lq   median       uq      max neval
#  fun3(mydf) 369.5957 422.9507 438.8742 462.6958 486.3757    20
#  fun4(mydf) 238.1093 316.9685 323.0659 328.0969 341.5154    20

Upvotes: 4

nacnudus
nacnudus

Reputation: 6528

Or using the automatic recycling of matrix comparisons:

bar <- as.matrix(foo)
c(0, rowSums(t(t(bar[-1, ]) == bar[1, ])))
# [1] 0 1 2

t() is there twice because the recycling is column- rather than row-wise.

Upvotes: 5

TheComeOnMan
TheComeOnMan

Reputation: 12875

> rowSums(sapply(foo, function(x) c(0,x[1] == x[2:nrow(foo)])))
[1] 0 1 2

Upvotes: 4

Related Questions