Reputation: 21

R - delete duplicate values based on multiple column keeping the row

My data set looks like this after merge()

id  ValueA  ValueB  ValueC  ValueD  ValueE  ValueF
1   page a  100     email   page a  300     Social
2   page b  130     social  page b  401     Email
3   page c  200     email   page c  234     Referral
4   page c  200     email   page c  345     Email
5   page c  200     email   page c  654     Social
6   page a  345     social  page d  237     Social
7   page e  200     social  page e  745     Email
8   page e  200     social  page e  675     Referral
9   page f  989     email   page f  123     social
10  page a  123     referralpage g  132     email

I want to delete duplicates values based on column "ValueA", "ValueB" and "ValueC" but keep rows 4, 5 and 8 because ValueD, VelueE and ValueF are still valid.

The expected output is

 id  ValueA  ValueB  ValueC  ValueD  ValueE  ValueF
1   page a  100     email   page a  300     Social
2   page b  130     social  page b  401     Email
3   page c  200     email   page c  234     Referral
4                           page c  345     Email
5                           page c  654     Social
6   page a  345     social  page d  237     Social
7   page e  200     social  page e  745     Email
8                           page e  675     Referral
9   page f  989     email   page f  123     social
10  page a  123     referralpage g  132     email

I tried using distinc()

df <- df %>% distinct(ValueA, ValueB, ValueC, .keep_all = T)

But it deletes the entire row

Upvotes: 1

Answers (4)

S. Ash

Reputation: 68

You can use dplyr to group by the columns with duplicate values you want to remove. Since you can't change them because you're grouping by them, you can create new columns without the duplicates.

test1<-test %>%
  group_by(ValueA, ValueB, ValueC) %>%
  mutate(ValueAA = ifelse(duplicated(ValueA), NA, ValueA),
         ValueBB = ifelse(duplicated(ValueB), NA, ValueB),
         ValueCC = ifelse(duplicated(ValueC), NA, ValueC)) %>%
  ungroup() %>%
  mutate(ValueA = ValueAA,
         ValueB = ValueBB,
         ValueC = ValueCC) %>%
  select(1:7)

The duplicate values are now replaced with NAs, but you can further replace the NAs with blanks.

Upvotes: 0

nsinghphd

Reputation: 2022

A non-tidyverse base R answer to your problem is

df[duplicated(df[, c('ValueA', 'ValueB', 'ValueC')]), 
   c('ValueA', 'ValueB', 'ValueC')] <- ""

Upvotes: 1

AntoniosK

Reputation: 16121

library(tidyverse)

# example data
dt = read.table(text = "
id  ValueA  ValueB  ValueC  ValueD  ValueE  ValueF
1   pagea  100     email   pagea  300     Social
2   pageb  130     social  pageb  401     Email
3   pagec  200     email   pagec  234     Referral
4   pagec  200     email   pagec  345     Email
5   pagec  200     email   pagec  654     Social
6   pagea  345     social  paged  237     Social
7   pagee  200     social  pagee  745     Email
8   pagee  200     social  pagee  675     Referral
9   pagef  989     email   pagef  123     social
10  pagea  123     referral pageg  132     email
", header=T, stringsAsFactors = F)

dt %>%
  group_by(ValueA, ValueB, ValueC) %>%    # for each combination of those variables
  mutate(flag = row_number()) %>%         # add the number of appearance (i.e. row number)
  ungroup() %>%                           # forget the grouping
  mutate_at(vars(ValueA, ValueB, ValueC), ~ifelse(flag > 1, "", .)) %>%  # update to empty cell if this is a duplicate row
  select(-flag) %>%                       # remove that column
  data.frame()                            # only for visualisation purpose

#    id ValueA ValueB   ValueC ValueD ValueE   ValueF
# 1   1  pagea    100    email  pagea    300   Social
# 2   2  pageb    130   social  pageb    401    Email
# 3   3  pagec    200    email  pagec    234 Referral
# 4   4                         pagec    345    Email
# 5   5                         pagec    654   Social
# 6   6  pagea    345   social  paged    237   Social
# 7   7  pagee    200   social  pagee    745    Email
# 8   8                         pagee    675 Referral
# 9   9  pagef    989    email  pagef    123   social
# 10 10  pagea    123 referral  pageg    132    email

Upvotes: 1

superboreen

Reputation: 57

Something here might help (in the Change Column Value Conditionally section). YMMV.

https://rstudio-pubs-static.s3.amazonaws.com/314427_a1a32bf219ea405c8728e35c72060f1a.html#change-column-value-conditionally.

Upvotes: 0

R - delete duplicate values based on multiple column keeping the row

Answers (4)

Related Questions