Reputation: 69
I am an R amateur and learning slowly. I present the situation:
I have two data frames with several columns (4) and +10000 rows looking like this, both:
df1: df2:
Nº x y attr Nº x y attr
1 45 34 X 1 34 23 x
1 48 45 XX 4 123 45 x
1 41 23 X 4 99 69 xx
4 23 12 X 4 112 80 xx
4 28 16 X 5 78 80 x
5 78 80 XXX 5 69 74 xx
...
I would like to compare both data frames based on x,y (coordinates) to delete in df1 all the values that also appear in df2 (all the values/coordinates that are contained in both datasets, delete them in df1).
So in my example, the last row of df1 would be deleted because the same coordinates are in df2.
What I am doing is using a double loop for(), one for one dataset and another one for the other, to compare one by one all the values possible. I know this is extremely inefficient and it takes also a lot of time if I increase the amount of data.
What other ways are there to do this? There are probably some functions but I generally don't know how to use them so much and it gives me problems.
Thank you very much!!
Upvotes: 0
Views: 141
Reputation: 294
3 lines of code
#generate sample data
x1 <- sample(1:50,9001, T)
y1 <- sample(1:50,9001, T)
x2 <- sample(1:50,9001, T)
y2 <- sample(1:50,9001, T)
df1 <- data.frame(id =1:9001, x1,y1, stringsAsFactors = F)
df2 <- data.frame(id =1:9001, x2,y2, stringsAsFactors = F)
#add a match column to each dataframe
df1$match <- paste(df1$x1, df1$y1)
df2$match <- paste(df2$x2, df2$y2)
#overwrite df1 with the date of df1 that does not appear in df2
df1 <- df1[!df1$match %in% df2$match,]
Upvotes: 0
Reputation: 6227
Not the most elegant solution but gets the job done:
df2 = fread('Nº x y attr
1 34 23 x
4 123 45 x
4 99 69 xx
4 112 80 xx
5 78 80 x
5 69 74 xx')
df1 = fread('Nº x y attr
1 45 34 X
1 48 45 XX
1 41 23 X
4 23 12 X
4 28 16 X
5 78 80 XXX')
> df1[!stringr::str_c(df1$x, df1$y, sep="_") %in% stringr::str_c(df2$x, df2$y, sep="_"),]
Nº x y attr
1: 1 45 34 X
2: 1 48 45 XX
3: 1 41 23 X
4: 4 23 12 X
5: 4 28 16 X
Explanation:
It's best to use vectorised functions rather than loops. !stringr::str_c(df1$x, df1$y, sep="_") %in% stringr::str_c(df2$x, df2$y, sep="_")
concatenates the x and y columns into a string and then finds elements from df1 that aren't in df2. This creates a logical vector of TRUE FALSE values which we can then use to subset df1.
EDIT:
I was curious to see if mine or @dww answer was faster:
> library(microbenchmark)
>
> n=100000
>
> df1 = data.table(x = sample(n), y=sample(n))
> df2 = data.table(x = sample(n), y=sample(n))
>
>
>
> microbenchmark(
... df1[!stringr::str_c(df1$x, df1$y, sep="_") %in% stringr::str_c(df2$x, df2$y, sep="_"),],
... df1[fsetdiff(df1[, .(x,y)] , df2[, .(x,y)] ), on=c('x','y')]
... )
Unit: milliseconds
expr
df1[!stringr::str_c(df1$x, df1$y, sep = "_") %in% stringr::str_c(df2$x, df2$y, sep = "_"), ]
df1[fsetdiff(df1[, .(x, y)], df2[, .(x, y)]), on = c("x", "y")]
min lq mean median uq max neval
168.40953 199.37183 219.30054 209.61414 222.08134 364.3458 100
41.07557 42.67679 52.34855 44.34379 59.27378 152.1283 100
Seems like the data.table version of dww is ~5x faster.
Upvotes: 1
Reputation: 31454
A library(data.table)
method:
df1[fsetdiff(df1[, .(x,y)] , df2[, .(x,y)] ), on=c('x','y')]
# Nº x y attr
#1: 1 45 34 X
#2: 1 48 45 XX
#3: 1 41 23 X
#4: 4 23 12 X
#5: 4 28 16 X
Upvotes: 1