user2778822
user2778822

Reputation: 71

Compare each cell for equality in two data frames of equal size in R

I have 2 dataframe, say A and B, of equal size (rows and columns). I would like to output a dataframe, say C, of same size with all the values as 0 or 1.

C[i,j] = 0, if A[i,j] != B[i,j]
C[i,j] = 1, if A[i,j] == B[i,j]

I do not want to use loops or ifelse statement as I have successfully done that but it takes very long time. If there is any other straight forward way to do the same, it would be really helpful. Thanks

Upvotes: 2

Views: 3904

Answers (4)

R Yoda
R Yoda

Reputation: 8750

Simply compare the two data.frames to get a matrix with the same size and a logical in the cells indicating the comparison result:

A <- mtcars
B <- mtcars

A == B

Result (first rows shown only):

                     mpg  cyl disp   hp drat   wt qsec   vs   am gear carb
Mazda RX4           TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Mazda RX4 Wag       TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Datsun 710          TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Hornet 4 Drive      TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

To get a data.frame from the comparison use:

C <- as.data.frame(A == B)

You can use the fact that TRUE == 1 and FALSE == 0 in R (as the OP asked for) to coerce the result into an integer explicitly:

as.data.frame(lapply(as.data.frame(A == B), as.integer))

Multiplying by 1 (as proposed in another answer) is prettier and probably more efficient (avoids :

as.data.frame(1 * (A == B))

Edit++ [Benchmark added; Benchmark improved for consistency]:

A benchmark between the different answers based on data.frames with 10 Mio. rows (about 260 MB)...

library(microbenchmark)   # install.packages("microbenchmark")
library(data.table)

A <- data.frame(col1 = 1:1E7,
                col2 = rep(c("a string", "another string"), 1E7/2),
                col3 = 1:1E7,
                col4 = 1:1E7,
                col5 = rep(LETTERS[1:10],1E6),
                stringsAsFactors = FALSE)
B <- A
B[1,1]=100  # change one cell to create a copy of the data.frame

microbenchmark(DF.equals       = as.data.frame(A == B),
               DF.mult         = as.data.frame(1 * (A == B)),
               DF.map          = as.data.frame(Map(`==`, A, B)),
               matrix.equals   = A == B,
               matrix.mult     = 1 * (A == B),
               matrix.map      = do.call(cbind, Map(`==`, A, B)),  # causes a warning: duplicated levels in factors are deprecated
               list.map        = Map(`==`, A, B),                  # fast cause it does not construct a matrix but only vectors
               times = 100)

shows the Map() function as the clear winner (on my system) being twice to four times faster as other variants and that the result as matrix is much faster than a data.frame:

Unit: milliseconds
          expr      min       lq     mean   median       uq      max neval    cld
     DF.equals 627.2541 630.7565 654.0266 635.1831 678.8903 686.0753   100     e 
      DF.mult  743.8531 751.7933 781.1876 796.2282 799.1881 848.2455   100      f
        DF.map 169.6967 170.5842 176.5944 171.5072 173.5665 223.3354   100 a     
 matrix.equals 294.2570 297.5330 311.8095 299.8093 345.0827 351.9193   100   c   
  matrix.mult  402.6166 406.5279 422.9322 408.3012 453.4484 602.2139   100    d  
    matrix.map 206.2596 208.4230 217.8891 209.8968 211.4139 266.1867   100  b    
      list.map 169.1922 170.5403 175.7539 171.4602 173.3891 224.7062   100 a   

BTW:

What I really like is how you can do some statistics now, e. g. count the number of mismatches per column (or row if you use rowSums instead):

colSums(C != TRUE)

or

colSums(A != B)

to get a result usable for automatic checking of preconditions (e. g. no mismatches allowed):

 mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
   0    0    0    0    0    0    0    0    0    0    0 

Upvotes: 9

SamPassmore
SamPassmore

Reputation: 1365

This example produces a T/F matrix which can be effectively treated as 0/1 in R

x = matrix(1:9, nrow = 3)
y = matrix(9:1, nrow = 3)
x == y

Since there were a couple of other suggestions in the answers I thought I'd test which was the quickest since that was a requirement of the question.

Here equals refers to the A == B solution and map_xy is the Map solution.

microbenchmark(equals(x,y), map_xy(x,y), times = 1000)
Unit: nanoseconds
         expr   min    lq      mean  median    uq   max neval
 equals(x, y)   360   399   468.491   459.0   508  3473  1000
 map_xy(x, y) 10909 12114 13506.830 13132.5 14158 77743  1000

It looks like equals is a much faster option - but as the Map answer indicates it might perform better with larger datasets. So I tested again with reasonably sized data:

x_big = matrix(1:900000, nrow = 3)
> y_big = matrix(900000:1, nrow = 3)
> microbenchmark(equals(x_big,y_big), map_xy(x_big,y_big), times = 100)
Unit: milliseconds
                 expr        min          lq        mean      median          uq
 equals(x_big, y_big)   1.579069    2.118332    2.515257    2.225747    2.375377
 map_xy(x_big, y_big) 846.172497 1040.383027 1165.354138 1147.239396 1321.166762
        max neval
   21.48414   100
 1489.81884   100

Which suggests that equals is still the faster option.

EDIT

In response to comments here is the code for each function. I have edited these slightly to convert the output to a data.frame (although I personally think this step is unnecessary)

equals = function(x,y){
  as.data.frame(x == y)
}

map_xy = function(x,y){
  Map('==', x, y) %>% 
    unlist(.) %>%
    matrix(., nrow = 3) %>%
    as.data.frame(.)
}

This changes the benchmark results but not the outcome:

For small matrices:

Unit: microseconds
         expr     min       lq      mean   median       uq      max neval
 equals(x, y)  18.090  20.3205  24.31075  22.0205  23.7285  781.048  1000
 map_xy(x, y) 172.699 186.0775 209.39585 193.3645 204.0220 2646.419  1000

For large matrices:

Unit: milliseconds
                 expr       min        lq     mean    median        uq      max
 equals(x_big, y_big)  533.3274  646.0605  744.063  705.4923  871.3479 1067.411
 map_xy(x_big, y_big) 1637.2882 1820.8714 1938.458 1921.2563 2041.0533 2564.669
 neval
   100
   100

If you want the functions I originally used - just take out the code for converting into a data.frame.

Upvotes: 1

Giora Simchoni
Giora Simchoni

Reputation: 3689

Try:

C <- data.frame(1 * (A == B))

The 1* is for turning TRUE/FALSE to 0/1 as required.

Upvotes: 5

akrun
akrun

Reputation: 887078

We can use Map to compare the corresponding columns of two data.frame 'A' and 'B'

Map(`==`, A, B)

The advantage is that we get a list of logical vectors instead of a matrix in the workspace. If the datasets are really big, it could be memory limiting to have matrix output

Upvotes: 4

Related Questions