Reputation: 721

Cleaning the duplicates with a reference from another data frame

I want to get rid of the duplicates by using correct information in the another data frame.

The problem is original data has the duplicates both with the right values and wrong values. The right values are defined in another data frame, so I want to use that data frame as a reference for those rows.

So the job I want to do conditional for two rows. To illustrate it, lets say the original data is tree1 :

tree1 = data.frame( 
sp = c("oak","pine","apple","birch","oak","pine","apple","maple"), 
code = c(23:26,77,88,99,27))
> tree1
     sp code
1   oak   23
2  pine   24
3 apple   25
4 birch   26
5   oak   77
6  pine   88
7 apple   99
8 maple   27

And the reference data is tree2:

tree2 = data.frame( sp = c("oak","pine","apple"),
                    code = 23:25)
> tree2
     sp code
1   oak   23
2  pine   24
3 apple   25

And my desired output that I get rid of the duplicates with wrong values where I still have the original data should seem like below:

> tree3
     sp code
1   oak   23
2  pine   24
3 apple   25
4 birch   26
5 maple   27

I know that it seems like an easy conditional operation but I ended up deleting some original values or keeping the duplicates with wrong values in the end (other way around is not working). Simple R-base help would be great.

Upvotes: 4

Answers (5)

Rushabh Patel

Reputation: 2764

You can also do something like this using data.table package-

> setDT(tree2)[setDT(tree1),on=.(sp)][!(duplicated(sp)),.(sp,i.code)]

     sp    i.code
1:   oak     23
2:  pine     24
3: apple     25
4: birch     26
5: maple     27

Upvotes: 1

tmfmnk

Reputation: 39858

Also a dplyr possibility:

tree1 %>%
 filter(code %in% tree2$code | !sp %in% tree2$sp)

     sp code
1   oak   23
2  pine   24
3 apple   25
4 birch   26
5 maple   27

Or:

tree1 %>%
 left_join(tree2, by = c("sp" = "sp")) %>%
 filter(code.x == code.y | (!is.na(code.x) & is.na(code.y))) %>%
 transmute(sp = sp,
           code = code.x)

     sp code
1   oak   23
2  pine   24
3 apple   25
4 birch   26
5 maple   27

Or the first possibility in base R:

subset(tree1, code %in% tree2$code | !sp %in% tree2$sp)

Upvotes: 1

Aurèle

Reputation: 12819

Getting rid of the duplicates altogether, since the corresponding correct values are in second data frame anyway, and row-bind those

rbind(
  tree1[!(duplicated(tree1$sp) | duplicated(tree1$sp, fromLast = TRUE)), ],
  tree2
)
#>      sp code
#> 4 birch   26
#> 8 maple   27
#> 1   oak   23
#> 2  pine   24
#> 3 apple   25

^{Created on 2019-04-11 by the reprex package (v0.2.1)}

Upvotes: 1

nghauran

Reputation: 6768

Here is a dplyr option:

library(dplyr)
tree2bis <- filter(tree1, !(tree1$sp %in% tree2$sp)) # dataframe with no duplicated rows
tree1 %>% inner_join(tree2) %>% bind_rows(tree2bis)
# output
     sp code
1   oak   23
2  pine   24
3 apple   25
4 birch   26
5 maple   27

Upvotes: 2

Ronak Shah

Reputation: 388982

One option using base R mapply. Assuming you have same columns in tree1 and tree2 and in same order we can check values in tree1 which are present in tree2 and select only those rows where all the values match or no values match.

vals <- rowSums(mapply(`%in%`, tree1, tree2))
tree1[vals == ncol(tree1) | vals == 0, ]

#    sp  code
#1   oak   23
#2  pine   24
#3 apple   25
#4 birch   26
#8 maple   27

Upvotes: 3

Cleaning the duplicates with a reference from another data frame

Answers (5)

Related Questions