Reputation: 43
Solution
I went with the solutions provided by @MauritsEvers and @akrun below.
Question
For a data frame, I want to keep only 1 column of each set of duplicate columns. In addition, the column that is kept takes on a name that is a concatenation of all column names in the set of duplicate columns. There are multiple sets of duplicate columns in the data frame. The data frame contains tens of thousands of columns, so using a for loop might take too much time.
I have tried a combination of using the duplicate(), summary(), aggregate(), lapply(), apply(), and using for loops.
Input data frame (df_in):
0 1 2 3 4 5 6 7
0 1 0 0 1 0 1 1
0 1 0 1 1 0 0 0
1 0 1 0 0 1 1 0
Output data frame (df_out):
0-2-5 1-4 3 6 7
0 1 0 1 1
0 1 1 0 0
1 0 0 1 0
Upvotes: 0
Views: 1243
Reputation: 887118
Here is an option with tidyverse
. We gather
the data into 'long' format, conver the 'value' into a string, grouped by 'value', paste
the 'key' column together, separate the rows of 'value' and then spread
the 'value' column to get the expected output
library(tidyverse)
gather(df_in) %>%
group_by(key) %>%
summarise(value = toString(value)) %>%
group_by(value) %>%
summarise(key = paste(key, collapse="-")) %>%
separate_rows(value) %>%
group_by(key) %>%
mutate(n = row_number()) %>%
spread(key, value) %>%
select(-n)
# A tibble: 3 x 5
# `0-2-5` `1-4` `3` `6` `7`
# <chr> <chr> <chr> <chr> <chr>
#1 0 1 0 1 1
#2 0 1 1 0 0
#3 1 0 0 1 0
Or another option with tidyverse
would be
t(df_in) %>%
as.data.frame %>%
mutate(grp = group_indices(., V1, V2, V3)) %>%
mutate(rn = row_number() - 1) %>%
group_split(grp, keep = FALSE) %>%
map_dfc(~ .x %>%
mutate(rn = str_c(rn, collapse="-")) %>%
slice(1) %>%
gather(key, val, -rn) %>%
rename(!! .$rn[1] := val) %>%
select(ncol(.)))
# A tibble: 3 x 5
# `0-2-5` `3` `7` `6` `1-4`
# <int> <int> <int> <int> <int>
#1 0 0 1 1 1
#2 0 1 0 0 1
#3 1 0 0 1 0
Or we can also do this with data.table
methods
library(data.table)
dcast(melt(as.data.table(t(df_in))[, grp := .GRP, .(V1, V2, V3)][,
c(.SD[1], cn = paste(.I-1, collapse="-")) , .(grp)],
id.var = c('cn', 'grp')), variable ~ cn, value.var = 'value')[,
variable := NULL][]
# 0-2-5 1-4 3 6 7
#1: 0 1 0 1 1
#2: 0 1 1 0 0
#3: 1 0 0 1 0
df_in <- structure(list(`0` = c(0L, 0L, 1L), `1` = c(1L, 1L, 0L), `2` = c(0L,
0L, 1L), `3` = c(0L, 1L, 0L), `4` = c(1L, 1L, 0L), `5` = c(0L,
0L, 1L), `6` = c(1L, 0L, 1L), `7` = c(1L, 0L, 0L)),
class = "data.frame", row.names = c(NA, -3L))
Upvotes: 3
Reputation: 50678
You can do the following in base R
Get indices of identical columns
idx <- split(seq_along(names(df)), apply(df, 2, paste, collapse = "_"))
Sort indices from low to high
idx <- idx[order(sapply(idx, function(x) x[1]))]
Names of idx
as concatentation of column names
names(idx) <- sapply(idx, function(x) paste(names(df)[x], collapse = "_"))
Create final matrix
sapply(idx, function(x) df[, x[1]])
# col0_col2_col5 col1_col4 col3_col6 col7
#[1,] 0 1 1 1
#[2,] 0 1 0 0
#[3,] 1 0 1 0
Note that the resulting object is a matrix
, so if you need a data.frame
simply cast as.data.frame
.
I've changed your sample data slightly to not have numbers as column names.
df <- read.table(text =
"col0 col1 col2 col3 col4 col5 col6 col7
0 1 0 1 1 0 1 1
0 1 0 0 1 0 0 0
1 0 1 1 0 1 1 0", header = T)
Upvotes: 1