Reputation: 21
My data set looks like this after merge()
id ValueA ValueB ValueC ValueD ValueE ValueF
1 page a 100 email page a 300 Social
2 page b 130 social page b 401 Email
3 page c 200 email page c 234 Referral
4 page c 200 email page c 345 Email
5 page c 200 email page c 654 Social
6 page a 345 social page d 237 Social
7 page e 200 social page e 745 Email
8 page e 200 social page e 675 Referral
9 page f 989 email page f 123 social
10 page a 123 referralpage g 132 email
I want to delete duplicates values based on column "ValueA", "ValueB" and "ValueC" but keep rows 4, 5 and 8 because ValueD, VelueE and ValueF are still valid.
The expected output is
id ValueA ValueB ValueC ValueD ValueE ValueF
1 page a 100 email page a 300 Social
2 page b 130 social page b 401 Email
3 page c 200 email page c 234 Referral
4 page c 345 Email
5 page c 654 Social
6 page a 345 social page d 237 Social
7 page e 200 social page e 745 Email
8 page e 675 Referral
9 page f 989 email page f 123 social
10 page a 123 referralpage g 132 email
I tried using distinc()
df <- df %>% distinct(ValueA, ValueB, ValueC, .keep_all = T)
But it deletes the entire row
Upvotes: 1
Views: 548
Reputation: 68
You can use dplyr to group by the columns with duplicate values you want to remove. Since you can't change them because you're grouping by them, you can create new columns without the duplicates.
test1<-test %>%
group_by(ValueA, ValueB, ValueC) %>%
mutate(ValueAA = ifelse(duplicated(ValueA), NA, ValueA),
ValueBB = ifelse(duplicated(ValueB), NA, ValueB),
ValueCC = ifelse(duplicated(ValueC), NA, ValueC)) %>%
ungroup() %>%
mutate(ValueA = ValueAA,
ValueB = ValueBB,
ValueC = ValueCC) %>%
select(1:7)
The duplicate values are now replaced with NAs, but you can further replace the NAs with blanks.
Upvotes: 0
Reputation: 2022
A non-tidyverse
base R
answer to your problem is
df[duplicated(df[, c('ValueA', 'ValueB', 'ValueC')]),
c('ValueA', 'ValueB', 'ValueC')] <- ""
Upvotes: 1
Reputation: 16121
library(tidyverse)
# example data
dt = read.table(text = "
id ValueA ValueB ValueC ValueD ValueE ValueF
1 pagea 100 email pagea 300 Social
2 pageb 130 social pageb 401 Email
3 pagec 200 email pagec 234 Referral
4 pagec 200 email pagec 345 Email
5 pagec 200 email pagec 654 Social
6 pagea 345 social paged 237 Social
7 pagee 200 social pagee 745 Email
8 pagee 200 social pagee 675 Referral
9 pagef 989 email pagef 123 social
10 pagea 123 referral pageg 132 email
", header=T, stringsAsFactors = F)
dt %>%
group_by(ValueA, ValueB, ValueC) %>% # for each combination of those variables
mutate(flag = row_number()) %>% # add the number of appearance (i.e. row number)
ungroup() %>% # forget the grouping
mutate_at(vars(ValueA, ValueB, ValueC), ~ifelse(flag > 1, "", .)) %>% # update to empty cell if this is a duplicate row
select(-flag) %>% # remove that column
data.frame() # only for visualisation purpose
# id ValueA ValueB ValueC ValueD ValueE ValueF
# 1 1 pagea 100 email pagea 300 Social
# 2 2 pageb 130 social pageb 401 Email
# 3 3 pagec 200 email pagec 234 Referral
# 4 4 pagec 345 Email
# 5 5 pagec 654 Social
# 6 6 pagea 345 social paged 237 Social
# 7 7 pagee 200 social pagee 745 Email
# 8 8 pagee 675 Referral
# 9 9 pagef 989 email pagef 123 social
# 10 10 pagea 123 referral pageg 132 email
Upvotes: 1
Reputation: 57
Something here might help (in the Change Column Value Conditionally section). YMMV.
Upvotes: 0