Reputation: 163
I do not want to remove the duplicates, I want to record instances where a particular variable is duplicated and contains varying values for other columns. What I have works, but it's too slow.
So we have a data frame:
install.packages("data.table")
library(data.table)
df_test <- data.frame("ID" = c(1,1,1,2,3,4),
"Group1" = c("Red", "Blue", "Blue", "Red", "Yellow", "Green"),
"Group2" = c(2.5, 2.5, 3, 7, 5, 6),
"Group3" = c("X","X", "X", "Y", "Z", "X")
)
> df_test
ID Group1 Group2 Group3
1 1 Red 2.5 X
2 1 Blue 2.5 X
3 1 Blue 3.0 X
4 2 Red 7.0 Y
5 3 Yellow 5.0 Z
6 4 Green 6.0 X
We want to record all instances where there is varying information. In this case for ID = 1, we have varying information for both Group1 and Group2. So we want output as follows.
> repeated_df_test
ID Where_Repeated Values_Present
1: 1 Group1 Red, Blue
2: 1 Group2 2.5, 3
Right now this is how I have it. I want to make it more efficient. I'm not sure where to begin with that.
where_are_dupes_occuring <- function(d, idvar){
#d = df
#idvar = the id variable to use
nm <- deparse(substitute(d)) #name to use in assignment
dupes <- d[duplicated(d[[idvar]]), ] #find the duplicated IDs
dupes <- d[d[[idvar]] %in% dupes[[idvar]], ] #get all rows where there are duplicates
d <- dupes[!duplicated(dupes), ] #remove complete duplicates
where_repeated_f <- list()
for(id in levels(unique(as.factor(d[[idvar]])))){ #for each repeated id
dx <- d[d[[idvar]] == id, ] #get just the ID
where_repeated <- list()
for(c in colnames(dx)){ #for each column in data
uvals <- droplevels(unique(as.factor(dx[[c]]))) #get whats present
if(length(uvals) > 1){ #if theres more than 1 value present
values_present <- toString(uvals)
df <- data.frame("ID" = id, "Where_Repeated" = c, "Values_Present" = values_present) #record the repeated values
where_repeated[[c]] <- df
}
}
where_repeated_df <- rbindlist(where_repeated)
where_repeated_f[[id]] <- where_repeated_df
}
where_repeated_f_df <- rbindlist(where_repeated_f)
nmf <- paste0("repeated_",nm)
assign(nmf, where_repeated_f_df, envir = .GlobalEnv)
}
where_are_dupes_occuring(df_test, "ID")
Upvotes: 1
Views: 51
Reputation: 102241
Here is a data.table
option
subset(
melt(
setDT(df_test)[
,
lapply(.SD, function(x) list(unique(x))), ID
],
id.var = "ID",
value.name = "Values_Present",
variable.name = "Where_Repeated"
), lengths(Values_Present) > 1
)
which gives
ID Where_Repeated Values_Present
1: 1 Group1 Red, Blue
2: 1 Group2 2.5, 3
Upvotes: 1