The fastest way to remove rows from a very big dataframe matching another dataframe

Question

What is the fastest function in r to remove the rows in a dataframe if the same two first column is in another dataframe. For example, if the data frame A is as below(with more information columns):

    NAME    SURENAME
    John    Beer
    Rose    Pitt
    Bob     Kin
    Charile Kind
    Smith   Red
    Brad    Tea
    Kale    Joe
    Ana     Bread
    Lauren  Old
    Mike    Karl

and B as below:

NAME    SURENAME
Rose    Pitt
Smith   Red
Mike    Karl

I want B to be removed from A to be like:

    NAME    SURENAME
    John    Beer
    Bob     Kin
    Charile Kind
    Brad    Tea
    Kale    Joe
    Ana     Bread
    Lauren  Old

So in my case, A has 2 million rows (and 10 other columns) and B has 200,000 rows (all unique Name and Surnames).

caldwellst · Accepted Answer

Tested a benchmark filtering a data frame of 2 million rows by one with 200,000 rows, as indicated in the original post, where you can clearly see the speed of data.table relative to dplyr. Given the immense time dplyr functions took to run, particularly set_diff, I only ran each once.

rbenchmark::benchmark(
  "dplyr_anti_join" = {
    set.seed(1)
    df <- data.frame(a = letters[runif(10000000, min = 1, max = 26)],
                     b = runif(100000000, 1, 200000))

    indices <- data.frame(a = letters[runif(200000, min = 1, max = 26)],
                          b = 1:200000)
    dplyr::anti_join(df, indices, by = c("a", "b"))
  },
  "dplyr_set_diff" = {
    set.seed(1)
    df <- data.frame(a = letters[runif(10000000, min = 1, max = 26)],
                     b = runif(100000000, 1, 200000))

    indices <- data.frame(a = letters[runif(200000, min = 1, max = 26)],
                          b = 1:200000)
    dplyr::setdiff(df, indices)
  },
  "dt" = {
    set.seed(1)
    library(data.table)
    df <- data.table(a = letters[runif(10000000, min = 1, max = 26)],
                     b = runif(100000000, 1, 200000))

    indices <- data.table(a = letters[runif(200000, min = 1, max = 26)],
                          b = 1:200000)
    fsetdiff(df, indices)
  },
  replications = 1
)

#>              test replications elapsed relative user.self sys.self user.child sys.child
#> 1 dplyr_anti_join            1  637.06   13.165    596.86    11.50         NA        NA
#> 2  dplyr_set_diff            1 9981.93  206.281    320.67     4.66         NA        NA
#> 3              dt            1   48.39    1.000     80.61     8.73         NA        NA

The fastest way to remove rows from a very big dataframe matching another dataframe

Answers (2)

Related Questions