Reputation: 45
I have a data frame that is like:
c1 c2 c3 c4
r1 1 0 1 1
r2 0 0 1 1
r3 0 1 0 0
In this case, c3 and c4 are exactly the same. I would like to remove duplicate columns but keep column names of both c3 and c4, to get the data frame:
c1 c2 c3c4
r1 1 0 1
r2 0 0 1
r3 0 1 0
where the third column name joins the column names of the identical columns.
I feel like there should be an elegant way to do this that I just can't think of. Any help would be greatly appreciated!
Edit: Just to clarify, that my actual data frames are actually 1000 rows x 1000 columns and I don't know which of the columns are identical. So I need an automatic way of testing if columns are identical and where that is the case to combine the column names.
Upvotes: 4
Views: 92
Reputation: 3162
The extra information adds an interesting wrinkle! If you don't care about concatenating the names of the columns you could do something like this:
df <- data.frame(c1 = c(1,0,0), c2 = c(0,0,1), c3 = c(1,1,0), c4 = c(1,1,0), c5 = c(1,1,1), c6= c(1,1,1), c7 = c(2,2,2))
library(digest)
df_clean <- df[!duplicated(lapply(df, digest))]
If the column names are genuinely important, this is how I would go about it after looking at thepule's answer:
df_dups <- df[duplicated(lapply(df, digest))] #extract the duplicates
for (clean_col in 1:ncol(df_clean)){
for (dup_col in 1:ncol(df_dups)){
if (identical(df_clean[,clean_col], df_dups[,dup_col]) == TRUE){
colnames(df_clean)[clean_col] <- paste(colnames(df_clean)[clean_col], colnames(df_dups)[dup_col], sep = "")
}
}
}
The output with additional duplicates added for testing looks like this:
'data.frame': 3 obs. of 5 variables:
$ c1 : num 1 0 0
$ c2 : num 0 0 1
$ c3c4: num 1 1 0
$ c5c6: num 1 1 1
$ c7 : num 2 2 2
Upvotes: 2
Reputation: 1751
It is maybe not a super elegant solution, but it gets the job done.
If df
is your dataframe:
dups <- duplicated(lapply(df, function(x) x))
df_clean <- df[!dups]
df_dups <- df[dups]
for(z in 1: ncol(df_clean)){
i <- names(df_clean)[z]
df_clean[i] -> q
d <- which(
sapply(df_dups, function(x) {
ifelse(identical(x,as.vector(sapply(q, function(x) x))), T, F)
})
)
names(df_clean)[z] <- paste0(i, paste(names(df_dups)[d], collapse = ""))
}
The output is:
df_clean
c1 c2 c3c4
r1 1 0 1
r2 0 0 1
r3 0 1 0
This should work also if columns can have multiple duplicates.
Upvotes: 1