Spencer A Lank
Spencer A Lank

Reputation: 43

How to collapse unique duplicate columns to unique columns in R?

Solution

I went with the solutions provided by @MauritsEvers and @akrun below.

Question

For a data frame, I want to keep only 1 column of each set of duplicate columns. In addition, the column that is kept takes on a name that is a concatenation of all column names in the set of duplicate columns. There are multiple sets of duplicate columns in the data frame. The data frame contains tens of thousands of columns, so using a for loop might take too much time.

I have tried a combination of using the duplicate(), summary(), aggregate(), lapply(), apply(), and using for loops.

Input data frame (df_in):

0 1 2 3 4 5 6 7
0 1 0 0 1 0 1 1
0 1 0 1 1 0 0 0
1 0 1 0 0 1 1 0

Output data frame (df_out):

0-2-5 1-4 3 6 7
0     1   0 1 1
0     1   1 0 0
1     0   0 1 0

Upvotes: 0

Views: 1243

Answers (2)

akrun
akrun

Reputation: 887118

Here is an option with tidyverse. We gather the data into 'long' format, conver the 'value' into a string, grouped by 'value', paste the 'key' column together, separate the rows of 'value' and then spread the 'value' column to get the expected output

library(tidyverse)
gather(df_in) %>% 
   group_by(key) %>%
   summarise(value = toString(value)) %>% 
   group_by(value) %>% 
   summarise(key = paste(key, collapse="-")) %>% 
   separate_rows(value) %>% 
   group_by(key) %>%
   mutate(n = row_number()) %>% 
   spread(key, value) %>%
   select(-n)
# A tibble: 3 x 5
#  `0-2-5` `1-4` `3`   `6`   `7`  
#  <chr>   <chr> <chr> <chr> <chr>
#1 0       1     0     1     1    
#2 0       1     1     0     0    
#3 1       0     0     1     0    

Or another option with tidyverse would be

t(df_in) %>%
    as.data.frame %>%
    mutate(grp = group_indices(., V1, V2, V3)) %>%
    mutate(rn = row_number() - 1) %>% 
    group_split(grp, keep = FALSE) %>% 
    map_dfc(~ .x %>% 
           mutate(rn = str_c(rn, collapse="-")) %>% 
           slice(1)  %>% 
           gather(key, val, -rn) %>% 
           rename(!! .$rn[1] := val) %>% 
           select(ncol(.)))
# A tibble: 3 x 5
#  `0-2-5`   `3`   `7`   `6` `1-4`
#    <int> <int> <int> <int> <int>
#1       0     0     1     1     1
#2       0     1     0     0     1
#3       1     0     0     1     0

Or we can also do this with data.table methods

library(data.table)
dcast(melt(as.data.table(t(df_in))[, grp := .GRP, .(V1, V2, V3)][, 
     c(.SD[1], cn = paste(.I-1, collapse="-")) , .(grp)],
      id.var = c('cn', 'grp')), variable ~ cn, value.var = 'value')[, 
        variable := NULL][]
#    0-2-5 1-4 3 6 7
#1:     0   1 0 1 1
#2:     0   1 1 0 0
#3:     1   0 0 1 0

data

df_in <- structure(list(`0` = c(0L, 0L, 1L), `1` = c(1L, 1L, 0L), `2` = c(0L, 
 0L, 1L), `3` = c(0L, 1L, 0L), `4` = c(1L, 1L, 0L), `5` = c(0L, 
 0L, 1L), `6` = c(1L, 0L, 1L), `7` = c(1L, 0L, 0L)),
  class = "data.frame", row.names = c(NA, -3L))

Upvotes: 3

Maurits Evers
Maurits Evers

Reputation: 50678

You can do the following in base R

  1. Get indices of identical columns

    idx <- split(seq_along(names(df)), apply(df, 2, paste, collapse = "_"))
    
  2. Sort indices from low to high

    idx <- idx[order(sapply(idx, function(x) x[1]))]
    
  3. Names of idx as concatentation of column names

    names(idx) <- sapply(idx, function(x) paste(names(df)[x], collapse = "_"))
    
  4. Create final matrix

    sapply(idx, function(x) df[, x[1]])
    #     col0_col2_col5 col1_col4 col3_col6 col7
    #[1,]              0         1         1    1
    #[2,]              0         1         0    0
    #[3,]              1         0         1    0
    

Note that the resulting object is a matrix, so if you need a data.frame simply cast as.data.frame.


Sample data

I've changed your sample data slightly to not have numbers as column names.

df <- read.table(text =
    "col0 col1 col2 col3 col4 col5 col6 col7
0 1 0 1 1 0 1 1
0 1 0 0 1 0 0 0
1 0 1 1 0 1 1 0", header = T)

Upvotes: 1

Related Questions