Reputation: 3125
I have a dataframe like this:
| ID | VisitNum | Type |
|----|----------|-------------|
| 1 | 1 | Store |
| 1 | 1 | Online |
| 2 | 3 | Catalog |
| 2 | 3 | SmokeSignal |
| 3 | 100 | Walk-In |
I want to remove duplicate rows, but there is a pesky set of human-defined criteria for what to remove. Basically,
"If ID and VisitNum show up more than once, then
If the duplicate values for Type are Store and Online, remove the row with Online
OR
If the duplicate values for Type are Catalog and SmokeSignal, remove the row with SmokeSignal"
The output would be:
| ID | VisitNum | Type |
|----|----------|-------------|
| 1 | 1 | Store |
| 2 | 3 | Catalog |
| 3 | 100 | Walk-In |
I wanted to use dplyr and do some grouping and filtering...I'm thinking something like this:
dupes <- dupes %>%
group_by(ID, VisitNum) %>%
summarise(Count=n() )
filter( <here's where I get stuck...some condition where the Count is greater than 1 then...> )
Any push in the right direction would be greatly appreciated.
And here is my dataframe:
dupes <- data.frame(ID=c(1, 1, 2, 2, 3), VisitNum=c(1, 1, 3, 3, 100), Type=c('Store', 'Online', 'Catalog', 'SmokeSignal', 'Walk-In'))
Upvotes: 1
Views: 635
Reputation: 887148
Based on the example, we can use slice
to get the first element assuming that the 'Type' is already in order
library(dplyr)
dupes %>%
group_by(ID, VisitNum) %>%
slice(1)
# ID VisitNum Type
# <dbl> <dbl> <fctr>
#1 1 1 Store
#2 2 3 Catalog
#3 3 100 Walk-In
Or based on the criteria
dupes %>%
group_by(ID, VisitNum) %>%
filter((n() > 1 & !(Type %in% c("Online", "SmokeSignal")))| n()==1)
# ID VisitNum Type
# <dbl> <dbl> <fctr>
#1 1 1 Store
#2 2 3 Catalog
#3 3 100 Walk-In
Upvotes: 1