Sean
Sean

Reputation: 103

The fastest way to remove rows from a very big dataframe matching another dataframe

What is the fastest function in r to remove the rows in a dataframe if the same two first column is in another dataframe. For example, if the data frame A is as below(with more information columns):

    NAME    SURENAME
    John    Beer
    Rose    Pitt
    Bob     Kin
    Charile Kind
    Smith   Red
    Brad    Tea
    Kale    Joe
    Ana     Bread
    Lauren  Old
    Mike    Karl 

and B as below:

NAME    SURENAME
Rose    Pitt
Smith   Red
Mike    Karl

I want B to be removed from A to be like:

    NAME    SURENAME
    John    Beer
    Bob     Kin
    Charile Kind
    Brad    Tea
    Kale    Joe
    Ana     Bread
    Lauren  Old

So in my case, A has 2 million rows (and 10 other columns) and B has 200,000 rows (all unique Name and Surnames).

Upvotes: 0

Views: 309

Answers (2)

caldwellst
caldwellst

Reputation: 5956

Tested a benchmark filtering a data frame of 2 million rows by one with 200,000 rows, as indicated in the original post, where you can clearly see the speed of data.table relative to dplyr. Given the immense time dplyr functions took to run, particularly set_diff, I only ran each once.

rbenchmark::benchmark(
  "dplyr_anti_join" = {
    set.seed(1)
    df <- data.frame(a = letters[runif(10000000, min = 1, max = 26)],
                     b = runif(100000000, 1, 200000))

    indices <- data.frame(a = letters[runif(200000, min = 1, max = 26)],
                          b = 1:200000)
    dplyr::anti_join(df, indices, by = c("a", "b"))
  },
  "dplyr_set_diff" = {
    set.seed(1)
    df <- data.frame(a = letters[runif(10000000, min = 1, max = 26)],
                     b = runif(100000000, 1, 200000))

    indices <- data.frame(a = letters[runif(200000, min = 1, max = 26)],
                          b = 1:200000)
    dplyr::setdiff(df, indices)
  },
  "dt" = {
    set.seed(1)
    library(data.table)
    df <- data.table(a = letters[runif(10000000, min = 1, max = 26)],
                     b = runif(100000000, 1, 200000))

    indices <- data.table(a = letters[runif(200000, min = 1, max = 26)],
                          b = 1:200000)
    fsetdiff(df, indices)
  },
  replications = 1
)

#>              test replications elapsed relative user.self sys.self user.child sys.child
#> 1 dplyr_anti_join            1  637.06   13.165    596.86    11.50         NA        NA
#> 2  dplyr_set_diff            1 9981.93  206.281    320.67     4.66         NA        NA
#> 3              dt            1   48.39    1.000     80.61     8.73         NA        NA

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 101189

Maybe you can try the code below using setdiff() from dplyr package, but you need to check its speed for large data frame (I am not sure about its performance then)

C <- dplyr::setdiff(A,B)

such that

> C
     NAME SURENAME
1    John     Beer
2     Bob      Kin
3 Charile     Kind
4    Brad      Tea
5    Kale      Joe
6     Ana    Bread
7  Lauren      Old

DATA

A <- structure(list(NAME = c("John", "Rose", "Bob", "Charile", "Smith", 
"Brad", "Kale", "Ana", "Lauren", "Mike"), SURENAME = c("Beer", 
"Pitt", "Kin", "Kind", "Red", "Tea", "Joe", "Bread", "Old", "Karl"
)), class = "data.frame", row.names = c(NA, -10L))

B <- structure(list(NAME = c("Rose", "Smith", "Mike"), SURENAME = c("Pitt", 
"Red", "Karl")), class = "data.frame", row.names = c(NA, -3L))

Upvotes: 1

Related Questions