Reputation: 71
I have 2 dataframe, say A and B, of equal size (rows and columns). I would like to output a dataframe, say C, of same size with all the values as 0 or 1.
C[i,j] = 0, if A[i,j] != B[i,j]
C[i,j] = 1, if A[i,j] == B[i,j]
I do not want to use loops or ifelse statement as I have successfully done that but it takes very long time. If there is any other straight forward way to do the same, it would be really helpful. Thanks
Upvotes: 2
Views: 3904
Reputation: 8750
Simply compare the two data.frame
s to get a matrix
with the same size and a logical in the cells indicating the comparison result:
A <- mtcars
B <- mtcars
A == B
Result (first rows shown only):
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Mazda RX4 Wag TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Datsun 710 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Hornet 4 Drive TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
To get a data.frame
from the comparison use:
C <- as.data.frame(A == B)
You can use the fact that TRUE
== 1 and FALSE
== 0 in R (as the OP asked for) to coerce the result into an integer explicitly:
as.data.frame(lapply(as.data.frame(A == B), as.integer))
Multiplying by 1 (as proposed in another answer) is prettier and probably more efficient (avoids :
as.data.frame(1 * (A == B))
Edit++ [Benchmark added; Benchmark improved for consistency]:
A benchmark between the different answers based on data.frame
s with 10 Mio. rows (about 260 MB)...
library(microbenchmark) # install.packages("microbenchmark")
library(data.table)
A <- data.frame(col1 = 1:1E7,
col2 = rep(c("a string", "another string"), 1E7/2),
col3 = 1:1E7,
col4 = 1:1E7,
col5 = rep(LETTERS[1:10],1E6),
stringsAsFactors = FALSE)
B <- A
B[1,1]=100 # change one cell to create a copy of the data.frame
microbenchmark(DF.equals = as.data.frame(A == B),
DF.mult = as.data.frame(1 * (A == B)),
DF.map = as.data.frame(Map(`==`, A, B)),
matrix.equals = A == B,
matrix.mult = 1 * (A == B),
matrix.map = do.call(cbind, Map(`==`, A, B)), # causes a warning: duplicated levels in factors are deprecated
list.map = Map(`==`, A, B), # fast cause it does not construct a matrix but only vectors
times = 100)
shows the Map()
function as the clear winner (on my system) being twice to four times faster as other variants and that the result as matrix
is much faster than a data.frame
:
Unit: milliseconds
expr min lq mean median uq max neval cld
DF.equals 627.2541 630.7565 654.0266 635.1831 678.8903 686.0753 100 e
DF.mult 743.8531 751.7933 781.1876 796.2282 799.1881 848.2455 100 f
DF.map 169.6967 170.5842 176.5944 171.5072 173.5665 223.3354 100 a
matrix.equals 294.2570 297.5330 311.8095 299.8093 345.0827 351.9193 100 c
matrix.mult 402.6166 406.5279 422.9322 408.3012 453.4484 602.2139 100 d
matrix.map 206.2596 208.4230 217.8891 209.8968 211.4139 266.1867 100 b
list.map 169.1922 170.5403 175.7539 171.4602 173.3891 224.7062 100 a
BTW:
What I really like is how you can do some statistics now, e. g. count the number of mismatches per column (or row if you use rowSums
instead):
colSums(C != TRUE)
or
colSums(A != B)
to get a result usable for automatic checking of preconditions (e. g. no mismatches allowed):
mpg cyl disp hp drat wt qsec vs am gear carb
0 0 0 0 0 0 0 0 0 0 0
Upvotes: 9
Reputation: 1365
This example produces a T/F matrix which can be effectively treated as 0/1 in R
x = matrix(1:9, nrow = 3)
y = matrix(9:1, nrow = 3)
x == y
Since there were a couple of other suggestions in the answers I thought I'd test which was the quickest since that was a requirement of the question.
Here equals refers to the A == B
solution and map_xy is the Map
solution.
microbenchmark(equals(x,y), map_xy(x,y), times = 1000)
Unit: nanoseconds
expr min lq mean median uq max neval
equals(x, y) 360 399 468.491 459.0 508 3473 1000
map_xy(x, y) 10909 12114 13506.830 13132.5 14158 77743 1000
It looks like equals is a much faster option - but as the Map
answer indicates it might perform better with larger datasets. So I tested again with reasonably sized data:
x_big = matrix(1:900000, nrow = 3)
> y_big = matrix(900000:1, nrow = 3)
> microbenchmark(equals(x_big,y_big), map_xy(x_big,y_big), times = 100)
Unit: milliseconds
expr min lq mean median uq
equals(x_big, y_big) 1.579069 2.118332 2.515257 2.225747 2.375377
map_xy(x_big, y_big) 846.172497 1040.383027 1165.354138 1147.239396 1321.166762
max neval
21.48414 100
1489.81884 100
Which suggests that equals is still the faster option.
EDIT
In response to comments here is the code for each function. I have edited these slightly to convert the output to a data.frame (although I personally think this step is unnecessary)
equals = function(x,y){
as.data.frame(x == y)
}
map_xy = function(x,y){
Map('==', x, y) %>%
unlist(.) %>%
matrix(., nrow = 3) %>%
as.data.frame(.)
}
This changes the benchmark results but not the outcome:
For small matrices:
Unit: microseconds
expr min lq mean median uq max neval
equals(x, y) 18.090 20.3205 24.31075 22.0205 23.7285 781.048 1000
map_xy(x, y) 172.699 186.0775 209.39585 193.3645 204.0220 2646.419 1000
For large matrices:
Unit: milliseconds
expr min lq mean median uq max
equals(x_big, y_big) 533.3274 646.0605 744.063 705.4923 871.3479 1067.411
map_xy(x_big, y_big) 1637.2882 1820.8714 1938.458 1921.2563 2041.0533 2564.669
neval
100
100
If you want the functions I originally used - just take out the code for converting into a data.frame.
Upvotes: 1
Reputation: 3689
Try:
C <- data.frame(1 * (A == B))
The 1*
is for turning TRUE
/FALSE
to 0/1 as required.
Upvotes: 5
Reputation: 887078
We can use Map
to compare the corresponding columns of two data.frame 'A' and 'B'
Map(`==`, A, B)
The advantage is that we get a list
of logical vector
s instead of a matrix
in the workspace. If the datasets are really big, it could be memory limiting to have matrix output
Upvotes: 4