mmb_rach
mmb_rach

Reputation: 163

How to record varying values for duplicated IDs

I do not want to remove the duplicates, I want to record instances where a particular variable is duplicated and contains varying values for other columns. What I have works, but it's too slow.

So we have a data frame:

install.packages("data.table")
library(data.table)

df_test <- data.frame("ID" = c(1,1,1,2,3,4),
                 "Group1" = c("Red", "Blue", "Blue", "Red", "Yellow", "Green"),
                 "Group2" = c(2.5, 2.5, 3, 7, 5, 6),
                 "Group3" = c("X","X", "X", "Y", "Z", "X")
                 )

> df_test
  ID Group1 Group2 Group3
1  1    Red    2.5      X
2  1   Blue    2.5      X
3  1   Blue    3.0      X
4  2    Red    7.0      Y
5  3 Yellow    5.0      Z
6  4  Green    6.0      X

We want to record all instances where there is varying information. In this case for ID = 1, we have varying information for both Group1 and Group2. So we want output as follows.

> repeated_df_test
   ID Where_Repeated Values_Present
1:  1         Group1      Red, Blue
2:  1         Group2         2.5, 3

Right now this is how I have it. I want to make it more efficient. I'm not sure where to begin with that.

where_are_dupes_occuring <- function(d, idvar){
  #d = df
  #idvar = the id variable to use
  nm <- deparse(substitute(d)) #name to use in assignment
  
  dupes <- d[duplicated(d[[idvar]]), ] #find the duplicated IDs
  dupes <- d[d[[idvar]] %in% dupes[[idvar]], ] #get all rows where there are duplicates 
  d <- dupes[!duplicated(dupes), ] #remove complete duplicates 
  
  where_repeated_f <- list()
  for(id in levels(unique(as.factor(d[[idvar]])))){ #for each repeated id
    dx <- d[d[[idvar]] == id, ] #get just the ID
    where_repeated <- list()
    for(c in colnames(dx)){ #for each column in data
      uvals <- droplevels(unique(as.factor(dx[[c]]))) #get whats present
      if(length(uvals) > 1){ #if theres more than 1 value present
        values_present <- toString(uvals)
        df <- data.frame("ID" = id, "Where_Repeated" = c, "Values_Present" = values_present) #record the repeated values
        where_repeated[[c]] <- df
      }
    }
    where_repeated_df <- rbindlist(where_repeated)
    where_repeated_f[[id]] <- where_repeated_df
  }
  where_repeated_f_df <- rbindlist(where_repeated_f)
  nmf <- paste0("repeated_",nm)
  assign(nmf, where_repeated_f_df, envir = .GlobalEnv) 
}

where_are_dupes_occuring(df_test, "ID")

Upvotes: 1

Views: 51

Answers (1)

ThomasIsCoding
ThomasIsCoding

Reputation: 102241

Here is a data.table option

subset(
  melt(
    setDT(df_test)[
      ,
      lapply(.SD, function(x) list(unique(x))), ID
    ],
    id.var = "ID",
    value.name = "Values_Present",
    variable.name = "Where_Repeated"
  ), lengths(Values_Present) > 1
)

which gives

   ID Where_Repeated  Values_Present
1:  1          Group1      Red, Blue
2:  1          Group2         2.5, 3

Upvotes: 1

Related Questions