How to collapse unique duplicate columns to unique columns in R?

Question

Solution

I went with the solutions provided by @MauritsEvers and @akrun below.

Question

For a data frame, I want to keep only 1 column of each set of duplicate columns. In addition, the column that is kept takes on a name that is a concatenation of all column names in the set of duplicate columns. There are multiple sets of duplicate columns in the data frame. The data frame contains tens of thousands of columns, so using a for loop might take too much time.

I have tried a combination of using the duplicate(), summary(), aggregate(), lapply(), apply(), and using for loops.

Input data frame (df_in):

0 1 2 3 4 5 6 7
0 1 0 0 1 0 1 1
0 1 0 1 1 0 0 0
1 0 1 0 0 1 1 0

Output data frame (df_out):

0-2-5 1-4 3 6 7
0     1   0 1 1
0     1   1 0 0
1     0   0 1 0

Maurits Evers · Accepted Answer

You can do the following in base R

Get indices of identical columns

idx <- split(seq_along(names(df)), apply(df, 2, paste, collapse = "_"))

Sort indices from low to high

idx <- idx[order(sapply(idx, function(x) x[1]))]

Names of idx as concatentation of column names

names(idx) <- sapply(idx, function(x) paste(names(df)[x], collapse = "_"))

Create final matrix

sapply(idx, function(x) df[, x[1]])
#     col0_col2_col5 col1_col4 col3_col6 col7
#[1,]              0         1         1    1
#[2,]              0         1         0    0
#[3,]              1         0         1    0

Note that the resulting object is a matrix, so if you need a data.frame simply cast as.data.frame.

Sample data

I've changed your sample data slightly to not have numbers as column names.

df <- read.table(text =
    "col0 col1 col2 col3 col4 col5 col6 col7
0 1 0 1 1 0 1 1
0 1 0 0 1 0 0 0
1 0 1 1 0 1 1 0", header = T)

How to collapse unique duplicate columns to unique columns in R?

Answers (2)

data

Sample data

Related Questions