Reputation: 107
I have 2 dataframes.
df1-
col1 col2 col3 col4 col5 name1 A 23 x y name1 A 29 x y name1 B 17 x y name1 A 77 x y
df2-
col1 col2 col3 B 17 LL1 Z 193 KK1 A 77 LO9 Y 80 LK2
I want to return those rows from df1 if col2 and col3 of df1 are not equal to col1 and col2 of df2.
The output should be-
col1 col2 col3 col4 col5 name1 A 23 x y name1 A 29 x y
Solution I found-
unique.rows <- function (df1, df2) {
out <- NULL
for (i in 1:nrow(df1)) {
found <- FALSE
for (j in 1:nrow(df2)) {
if (all(df1[i,2:3] == df2[j,1:2])) {
found <- TRUE
break
}
}
if (!found) out <- rbind(out, df1[i,])
}
out
}
This solution is working fine but initially, I was applying for small dataframes. Now my df1 has about 10k rows and df2 has about 7 million rows. It is just running and running from last 2 days. Could anyone please suggest a fast way to do this?
Upvotes: 1
Views: 149
Reputation: 7475
try
> df1[!paste(df1$col2,df1$col3)%in%paste(df2$col1,df2$col2),]
col1 col2 col3 col4 col5
1 name1 A 23 x y
2 name1 A 29 x y
Upvotes: 3
Reputation: 60924
What is probably biting you is the line:
if (!found) out <- rbind(out, df1[i,])
You continuously grow a data.frame, which causes the operating system to allocate new memory for the object. I would recommend you preallocate a data.frame with enough room and then assign the right output to the right index. This should speed things up by several orders of magnitude.
In addition, R works vectorized so often there is no need for an explicit loop. See for example the answer by @ttmaccer. You could also take a look at data.table
, which is lightning fast for these kinds of operations.
Upvotes: 2