Reputation: 721
I want to get rid of the duplicates by using correct information in the another data frame.
The problem is original data has the duplicates both with the right values and wrong values. The right values are defined in another data frame, so I want to use that data frame as a reference for those rows.
So the job I want to do conditional for two rows. To illustrate it, lets say the original data is tree1
:
tree1 = data.frame(
sp = c("oak","pine","apple","birch","oak","pine","apple","maple"),
code = c(23:26,77,88,99,27))
> tree1
sp code
1 oak 23
2 pine 24
3 apple 25
4 birch 26
5 oak 77
6 pine 88
7 apple 99
8 maple 27
And the reference data is tree2
:
tree2 = data.frame( sp = c("oak","pine","apple"),
code = 23:25)
> tree2
sp code
1 oak 23
2 pine 24
3 apple 25
And my desired output that I get rid of the duplicates with wrong values where I still have the original data should seem like below:
> tree3
sp code
1 oak 23
2 pine 24
3 apple 25
4 birch 26
5 maple 27
I know that it seems like an easy conditional operation but I ended up deleting some original values or keeping the duplicates with wrong values in the end (other way around is not working). Simple R-base help would be great.
Upvotes: 4
Views: 73
Reputation: 2764
You can also do something like this using data.table
package-
> setDT(tree2)[setDT(tree1),on=.(sp)][!(duplicated(sp)),.(sp,i.code)]
sp i.code
1: oak 23
2: pine 24
3: apple 25
4: birch 26
5: maple 27
Upvotes: 1
Reputation: 39858
Also a dplyr
possibility:
tree1 %>%
filter(code %in% tree2$code | !sp %in% tree2$sp)
sp code
1 oak 23
2 pine 24
3 apple 25
4 birch 26
5 maple 27
Or:
tree1 %>%
left_join(tree2, by = c("sp" = "sp")) %>%
filter(code.x == code.y | (!is.na(code.x) & is.na(code.y))) %>%
transmute(sp = sp,
code = code.x)
sp code
1 oak 23
2 pine 24
3 apple 25
4 birch 26
5 maple 27
Or the first possibility in base R
:
subset(tree1, code %in% tree2$code | !sp %in% tree2$sp)
Upvotes: 1
Reputation: 12819
Getting rid of the duplicates altogether, since the corresponding correct values are in second data frame anyway, and row-bind those
rbind(
tree1[!(duplicated(tree1$sp) | duplicated(tree1$sp, fromLast = TRUE)), ],
tree2
)
#> sp code
#> 4 birch 26
#> 8 maple 27
#> 1 oak 23
#> 2 pine 24
#> 3 apple 25
Created on 2019-04-11 by the reprex package (v0.2.1)
Upvotes: 1
Reputation: 6768
Here is a dplyr
option:
library(dplyr)
tree2bis <- filter(tree1, !(tree1$sp %in% tree2$sp)) # dataframe with no duplicated rows
tree1 %>% inner_join(tree2) %>% bind_rows(tree2bis)
# output
sp code
1 oak 23
2 pine 24
3 apple 25
4 birch 26
5 maple 27
Upvotes: 2
Reputation: 388982
One option using base R mapply
. Assuming you have same columns in tree1
and tree2
and in same order we can check values in tree1
which are present in tree2
and select only those rows where all the values match or no values match.
vals <- rowSums(mapply(`%in%`, tree1, tree2))
tree1[vals == ncol(tree1) | vals == 0, ]
# sp code
#1 oak 23
#2 pine 24
#3 apple 25
#4 birch 26
#8 maple 27
Upvotes: 3