rlsrls
rlsrls

Reputation: 69

How to compare elements of two large datasets as efficient as possible?

I am an R amateur and learning slowly. I present the situation:

I have two data frames with several columns (4) and +10000 rows looking like this, both:

df1:                   df2:
Nº x   y   attr        Nº x   y   attr
1  45  34  X           1  34  23  x
1  48  45  XX          4  123 45  x
1  41  23  X           4  99  69  xx
4  23  12  X           4  112 80  xx
4  28  16  X           5  78  80  x
5  78  80  XXX         5  69  74  xx
...

I would like to compare both data frames based on x,y (coordinates) to delete in df1 all the values that also appear in df2 (all the values/coordinates that are contained in both datasets, delete them in df1).

So in my example, the last row of df1 would be deleted because the same coordinates are in df2.

What I am doing is using a double loop for(), one for one dataset and another one for the other, to compare one by one all the values possible. I know this is extremely inefficient and it takes also a lot of time if I increase the amount of data.

What other ways are there to do this? There are probably some functions but I generally don't know how to use them so much and it gives me problems.

Thank you very much!!

Upvotes: 0

Views: 141

Answers (3)

3 lines of code

#generate sample data
x1 <- sample(1:50,9001, T)
y1 <- sample(1:50,9001, T)

x2 <- sample(1:50,9001, T)
y2 <- sample(1:50,9001, T)

df1 <- data.frame(id =1:9001, x1,y1, stringsAsFactors = F)
df2 <- data.frame(id =1:9001, x2,y2, stringsAsFactors = F)

#add a match column to each dataframe
df1$match <- paste(df1$x1, df1$y1)
df2$match <- paste(df2$x2, df2$y2)

#overwrite df1 with the date of df1 that does not appear in df2
df1 <- df1[!df1$match %in% df2$match,]

Upvotes: 0

user438383
user438383

Reputation: 6227

Not the most elegant solution but gets the job done:

df2 = fread('Nº x   y   attr
1  34  23  x
4  123 45  x
4  99  69  xx
4  112 80  xx
5  78  80  x
5  69  74  xx')

df1 = fread('Nº x   y   attr        
1  45  34  X           
1  48  45  XX          
1  41  23  X          
4  23  12  X         
4  28  16  X        
5  78  80  XXX')     
> df1[!stringr::str_c(df1$x, df1$y, sep="_") %in% stringr::str_c(df2$x, df2$y, sep="_"),]
   Nº  x  y attr
1:  1 45 34    X
2:  1 48 45   XX
3:  1 41 23    X
4:  4 23 12    X
5:  4 28 16    X

Explanation:

It's best to use vectorised functions rather than loops. !stringr::str_c(df1$x, df1$y, sep="_") %in% stringr::str_c(df2$x, df2$y, sep="_") concatenates the x and y columns into a string and then finds elements from df1 that aren't in df2. This creates a logical vector of TRUE FALSE values which we can then use to subset df1.

EDIT:

I was curious to see if mine or @dww answer was faster:

> library(microbenchmark)
> 
> n=100000
> 
> df1 = data.table(x = sample(n), y=sample(n))
> df2 = data.table(x = sample(n), y=sample(n))
> 
> 
> 
> microbenchmark(
... df1[!stringr::str_c(df1$x, df1$y, sep="_") %in% stringr::str_c(df2$x, df2$y, sep="_"),],
... df1[fsetdiff(df1[, .(x,y)] , df2[, .(x,y)] ), on=c('x','y')]
... )
Unit: milliseconds
                                                                                              expr
 df1[!stringr::str_c(df1$x, df1$y, sep = "_") %in% stringr::str_c(df2$x,      df2$y, sep = "_"), ]
                                   df1[fsetdiff(df1[, .(x, y)], df2[, .(x, y)]), on = c("x", "y")]
       min        lq      mean    median        uq      max neval
 168.40953 199.37183 219.30054 209.61414 222.08134 364.3458   100
  41.07557  42.67679  52.34855  44.34379  59.27378 152.1283   100

Seems like the data.table version of dww is ~5x faster.

Upvotes: 1

dww
dww

Reputation: 31454

A library(data.table) method:

df1[fsetdiff(df1[, .(x,y)] , df2[, .(x,y)] ), on=c('x','y')]
#   Nº  x  y attr
#1:  1 45 34    X
#2:  1 48 45   XX
#3:  1 41 23    X
#4:  4 23 12    X
#5:  4 28 16    X

Upvotes: 1

Related Questions