Rajiv
Rajiv

Reputation: 107

Excluding rows if present in second dataframe in R

I have 2 dataframes.

df1-

col1   col2   col3    col4    col5
name1    A     23      x       y
name1    A     29      x       y
name1    B     17      x       y
name1    A     77      x       y

df2-

   
col1  col2  col3
B     17      LL1
Z     193     KK1       
A     77      LO9
Y     80      LK2       

I want to return those rows from df1 if col2 and col3 of df1 are not equal to col1 and col2 of df2.

The output should be-

col1   col2   col3    col4    col5
name1    A     23      x       y
name1    A     29      x       y

Solution I found-

unique.rows <- function (df1, df2) {   
  out <- NULL
  for (i in 1:nrow(df1)) {
    found <- FALSE
    for (j in 1:nrow(df2)) {
      if (all(df1[i,2:3] == df2[j,1:2])) {
        found <- TRUE
        break
      }
    }
    if (!found) out <- rbind(out, df1[i,])
  }
  out
}

This solution is working fine but initially, I was applying for small dataframes. Now my df1 has about 10k rows and df2 has about 7 million rows. It is just running and running from last 2 days. Could anyone please suggest a fast way to do this?

Upvotes: 1

Views: 149

Answers (2)

shhhhimhuntingrabbits
shhhhimhuntingrabbits

Reputation: 7475

try

> df1[!paste(df1$col2,df1$col3)%in%paste(df2$col1,df2$col2),]
   col1 col2 col3 col4 col5
1 name1    A   23    x    y
2 name1    A   29    x    y

Upvotes: 3

Paul Hiemstra
Paul Hiemstra

Reputation: 60924

What is probably biting you is the line:

if (!found) out <- rbind(out, df1[i,])

You continuously grow a data.frame, which causes the operating system to allocate new memory for the object. I would recommend you preallocate a data.frame with enough room and then assign the right output to the right index. This should speed things up by several orders of magnitude.

In addition, R works vectorized so often there is no need for an explicit loop. See for example the answer by @ttmaccer. You could also take a look at data.table, which is lightning fast for these kinds of operations.

Upvotes: 2

Related Questions