Reputation: 435
I have been using dplyr::all_equal
to try and find differences between datasets. I don't always understand the output when the datasets are not equal.
I've generated some small tibbles to try and tease out the meaning of the outputs by easy comparison between the tibbles, but the different outputs confuse me. I've looked at the documentation and it doesn't provide a satisfactory explanation for me, because there aren't any details explaining how the result explains differences besides positions. The examples in the documentation don't really cover this case either.
library(tidyverse)
set.seed(123)
df1 <- as_tibble(rpois(4, 2))
df2 <- as_tibble(rpois(4, 2))
df3 <- as_tibble(rpois(4, 2))
df4 <- as_tibble(rpois(4, 2))
df1
#> # A tibble: 4 x 1
#> value
#> <int>
#> 1 1
#> 2 3
#> 3 2
#> 4 4
df2
#> # A tibble: 4 x 1
#> value
#> <int>
#> 1 4
#> 2 0
#> 3 2
#> 4 4
df3
#> # A tibble: 4 x 1
#> value
#> <int>
#> 1 2
#> 2 2
#> 3 5
#> 4 2
df4
#> # A tibble: 4 x 1
#> value
#> <int>
#> 1 3
#> 2 2
#> 3 0
#> 4 4
all_equal(df1, df2)
#> [1] "Rows in x but not y: 1, 2. Rows in y but not x: 2. Rows with difference occurences in x and y: 4"
all_equal(df1, df4)
#> [1] "Rows in x but not y: 1. Rows in y but not x: 3. "
all_equal(df1, df3)
#> [1] "Rows in x but not y: 1, 2, 4. Rows in y but not x: 3. Rows with difference occurences in x and y: 3"
all_equal(df2, df3)
#> [1] "Rows in x but not y: 2, 1. Rows in y but not x: 3. Rows with difference occurences in x and y: 3"
all_equal(df2, df4)
#> [1] "Rows in y but not x: 1. Rows with difference occurences in x and y: 1"
Created on 2019-06-26 by the reprex package (v0.2.1)
If someone were to ask me "How many observations are different between the two sets," based on the outputs above, my response would be the largest number of rows returned by "Rows in __ but not __: number". So, for instance, I would say:
"The number of observations between df1
and df3
that are different is 3."
Is this the right idea? Also, I don't know how to interpret the "Rows with difference occurences in x and y: number " portion, since in all_equal(df1, df2)
, there are two differing observations between the sets but in row 4, the entries are the same.
Upvotes: 2
Views: 2071
Reputation: 5138
I recently had to do something similar for double-data entry and used base R. Not exactly what you asked for, but I hope it helps. This can be done simpler on a case-by-case basis (e.g., mapply(`==`, df1, df2)
), but I tailored my answer to scale for a lot of dataframes because you mention having 4. The code below tests each dataframe, row-by-row, for equality. Keep in mind, this solution is order-dependent (unlike all_equal
) and if your dataframes do not have identical columns# / row# you will need to adapt this solution before it is viable. Good luck!!!
library(tidyverse)
set.seed(123)
df1 <- as_tibble(rpois(4, 2))
df2 <- as_tibble(rpois(4, 2))
df3 <- as_tibble(rpois(4, 2))
df4 <- as_tibble(rpois(4, 2))
# Making a list of your dataframes
df_list <- mget(ls(pattern = "df\\d"))
# Creating indices for the comparison (from df_list)
indices <- combn(seq_along(df_list), 2, simplify = F)
# Comparing all elements of the df_list
comparisons <- lapply(indices, function(x) mapply(`==`, df_list[x[1]], df_list[x[2]]))
# Cleaning up names
names(comparisons) <- sapply(indices, paste, collapse = " vs ")
head(comparisons, 2)
$`1 vs 2`
df1
[1,] FALSE
[2,] FALSE
[3,] TRUE
[4,] TRUE
$`1 vs 3`
df1
[1,] FALSE
[2,] FALSE
[3,] FALSE
[4,] FALSE
# Now, summarise it however you like, e.g.: Pct agreement
sapply(comparisons, mean)
1 vs 2 1 vs 3 1 vs 4 2 vs 3 2 vs 4 3 vs 4
0.50 0.00 0.25 0.00 0.25 0.25
EDIT: the above solution is similar to using all_equal(df, df, ignore_col_order = FALSE, ignore_row_order = FALSE)
Upvotes: 3