Consolidate replicated columns in R

Question

I have a data frame that is like:

   c1 c2 c3 c4
 r1 1  0  1  1
 r2 0  0  1  1
 r3 0  1  0  0

In this case, c3 and c4 are exactly the same. I would like to remove duplicate columns but keep column names of both c3 and c4, to get the data frame:

where the third column name joins the column names of the identical columns.

I feel like there should be an elegant way to do this that I just can't think of. Any help would be greatly appreciated!

Edit: Just to clarify, that my actual data frames are actually 1000 rows x 1000 columns and I don't know which of the columns are identical. So I need an automatic way of testing if columns are identical and where that is the case to combine the column names.

Chris Townsend · Accepted Answer

The extra information adds an interesting wrinkle! If you don't care about concatenating the names of the columns you could do something like this:

df <- data.frame(c1 = c(1,0,0), c2 = c(0,0,1), c3 = c(1,1,0), c4 = c(1,1,0), c5 = c(1,1,1), c6= c(1,1,1), c7 = c(2,2,2))

library(digest)
df_clean <- df[!duplicated(lapply(df, digest))]

At this point df_clean would contain the data frame without any duplicates.

If the column names are genuinely important, this is how I would go about it after looking at thepule's answer:

df_dups <- df[duplicated(lapply(df, digest))] #extract the duplicates

for (clean_col in 1:ncol(df_clean)){
  for (dup_col in 1:ncol(df_dups)){
    if (identical(df_clean[,clean_col], df_dups[,dup_col]) == TRUE){
      colnames(df_clean)[clean_col] <- paste(colnames(df_clean)[clean_col], colnames(df_dups)[dup_col], sep = "")
    }
  }
}

The output with additional duplicates added for testing looks like this:

'data.frame':   3 obs. of  5 variables:
 $ c1  : num  1 0 0
 $ c2  : num  0 0 1
 $ c3c4: num  1 1 0
 $ c5c6: num  1 1 1
 $ c7  : num  2 2 2

Consolidate replicated columns in R

Answers (2)

At this point df_clean would contain the data frame without any duplicates.

Related Questions