Reputation: 103
What is the fastest function in r to remove the rows in a dataframe if the same two first column is in another dataframe. For example, if the data frame A is as below(with more information columns):
NAME SURENAME
John Beer
Rose Pitt
Bob Kin
Charile Kind
Smith Red
Brad Tea
Kale Joe
Ana Bread
Lauren Old
Mike Karl
and B as below:
NAME SURENAME
Rose Pitt
Smith Red
Mike Karl
I want B to be removed from A to be like:
NAME SURENAME
John Beer
Bob Kin
Charile Kind
Brad Tea
Kale Joe
Ana Bread
Lauren Old
So in my case, A has 2 million rows (and 10 other columns) and B has 200,000 rows (all unique Name and Surnames).
Upvotes: 0
Views: 309
Reputation: 5956
Tested a benchmark filtering a data frame of 2 million rows by one with 200,000 rows, as indicated in the original post, where you can clearly see the speed of data.table
relative to dplyr
. Given the immense time dplyr
functions took to run, particularly set_diff
, I only ran each once.
rbenchmark::benchmark(
"dplyr_anti_join" = {
set.seed(1)
df <- data.frame(a = letters[runif(10000000, min = 1, max = 26)],
b = runif(100000000, 1, 200000))
indices <- data.frame(a = letters[runif(200000, min = 1, max = 26)],
b = 1:200000)
dplyr::anti_join(df, indices, by = c("a", "b"))
},
"dplyr_set_diff" = {
set.seed(1)
df <- data.frame(a = letters[runif(10000000, min = 1, max = 26)],
b = runif(100000000, 1, 200000))
indices <- data.frame(a = letters[runif(200000, min = 1, max = 26)],
b = 1:200000)
dplyr::setdiff(df, indices)
},
"dt" = {
set.seed(1)
library(data.table)
df <- data.table(a = letters[runif(10000000, min = 1, max = 26)],
b = runif(100000000, 1, 200000))
indices <- data.table(a = letters[runif(200000, min = 1, max = 26)],
b = 1:200000)
fsetdiff(df, indices)
},
replications = 1
)
#> test replications elapsed relative user.self sys.self user.child sys.child
#> 1 dplyr_anti_join 1 637.06 13.165 596.86 11.50 NA NA
#> 2 dplyr_set_diff 1 9981.93 206.281 320.67 4.66 NA NA
#> 3 dt 1 48.39 1.000 80.61 8.73 NA NA
Upvotes: 1
Reputation: 101189
Maybe you can try the code below using setdiff()
from dplyr
package, but you need to check its speed for large data frame (I am not sure about its performance then)
C <- dplyr::setdiff(A,B)
such that
> C
NAME SURENAME
1 John Beer
2 Bob Kin
3 Charile Kind
4 Brad Tea
5 Kale Joe
6 Ana Bread
7 Lauren Old
DATA
A <- structure(list(NAME = c("John", "Rose", "Bob", "Charile", "Smith",
"Brad", "Kale", "Ana", "Lauren", "Mike"), SURENAME = c("Beer",
"Pitt", "Kin", "Kind", "Red", "Tea", "Joe", "Bread", "Old", "Karl"
)), class = "data.frame", row.names = c(NA, -10L))
B <- structure(list(NAME = c("Rose", "Smith", "Mike"), SURENAME = c("Pitt",
"Red", "Karl")), class = "data.frame", row.names = c(NA, -3L))
Upvotes: 1