Reputation: 866

How To Row Append Column With Same Column Name In A Data Frame

I have a dataset with thousands of columns with some of the column having identical column name. I want to merge column with same column name such that the values are appended as rows. And, for the columns that don't have column with same column name, 0 is appended in rows.

Clarification: Below is just an example, the real data set I have has thousands of column and many of those have column name that are duplicate and many aren't.

Sample Input Data

Col_1 Col_1 Col_1 Col_1 Col_2
  1     2     3     4   5
  5     6     7     8   5
  9    10    11    12   5
 13    14    15    16   5

Sample Output Data

Col_1 Col_2
  1    5
  2    5
  3    5
  4    5
  5    0
  6    0
  7    0 
  8    0
  9    0
 10    0
 11    0
 12    0
 13    0
 14    0
 15    0
 16    0

Upvotes: 1

Answers (3)

hmhensen

Reputation: 3195

Here's a pretty complicated answer. Some of the code is a bit clunky, but it is a general solution.

Solution

library(tidyverse)
library(magrittr)

# function to create lookup table, matching duplicate column names to syntactically valid names 
rel <- function(x) {x %>% 
  colnames %>% 
  make.names(., unique = TRUE) %>% 
  as.data.frame() %>% 
  mutate(names(x)) %>% 
  setNames(c("New", "Old")) }

# create lookup table to match old and new column names
lookup <- rel(df)

# gather df into long format
df_long <- df %>% 
  setNames(lookup$New) %>% 
  gather(var, value)

# match new names to original names
df_colnames <- lapply(1:length(unique(lookup$Old)), function(x) grepl(unique(lookup$Old)[x], df_long$var)) %>% 
  setNames(unique(lookup$Old)) %>% 
  as.data.frame

# vector replacing new syntactically valid names with original names
column <- lapply(names(df_colnames), function(x) ifelse(df_colnames[, x], x, F)) %>% 
  setNames(unique(lookup$Old)) %>% 
  as.data.frame %>% 
  unite(comb, sep = "") %>% 
  magrittr::extract(, "comb") %>% 
  gsub("FALSE", "", .)

# put original columns into lists
final_list <- df_long %>% 
  mutate(var = column) %>% 
  arrange(var, value) %>% 
  split(.$var) %>% 
  map(~select_at(.x, c("value"))) %>% 
  lapply(function(x) x$value)

# create vectors of zeros to append to original data
final_list_extend <- sapply(abs(unlist(lapply(final_list, length)) - max(unlist(lapply(final_list, length)))), function(x) rep(0, x))

# append zeros to original data and rename columns to match original names
output <- sapply(1:length(final_list), function(x) c(final_list[[x]], final_list_extend[[x]])) %>% 
  as_data_frame %>% 
  setNames(unique(lookup$Old))

#show result
output

# A tibble: 16 x 2
   Col_1 Col_2
   <dbl> <dbl>
 1     1     5
 2     2     5
 3     3     5
 4     4     5
 5     5     0
 6     6     0
 7     7     0
 8     8     0
 9     9     0
10    10     0
11    11     0
12    12     0
13    13     0
14    14     0
15    15     0
16    16     0

Data

df <- read.table(header = T, text = "
Col_1 Col_1 Col_1 Col_1 Col_2
  1     2     3     4   5
5     6     7     8   5
9    10    11    12   5
13    14    15    16   5") %>% 
  setNames(c("Col_1", "Col_1", "Col_1", "Col_1", "Col_2"))

Upvotes: 0

lkq

Reputation: 2366

Here is my way that involves some manual work. Let's assume your dataset is in the variable test

# may only require some of the packages of tidyverse
library(tidyverse)

# this will give all column unique names
renamed_test <- test %>%
                set_names(str_c(names(test), 1:ncol(test)))

# then for each duplicated column name, they now start with the same prefix;
# so select all these columns and use gather to append them one after another,
# and finally rename the merged column back to the original name
bound_col_1 <- renamed_test %>%
               select(starts_with("Col_1")) %>%
               gather %>%
               transmute(Col_1 = value)

# repeat this for 'Col_2'
# .....

# last, column bind all these results
bind_cols(bound_col_1, bound_col_2, [potentiall other variables])

Edit:

I generalized the solution so it will automatically find all duplicated columns and row bind each

library(tidyverse)

# testing data
test <- data.frame(c(1,2,3), c(7,8,9), c(4,5,6), c(10,11,12), c(100, 101, 102)) %>%
  set_names(c("Col_1", "Col_2", "Col_1", "Col_2", "Col_3"))

col_names <- names(test)

# find all columns that have duplicated columns
dup_names <- col_names[duplicated(col_names)]

# make the column names unique so it will work with tidyr
renamed_test <- test %>%
  set_names(str_c(col_names, "-", 1:ncol(test)))

unique_data <- test[!(duplicated(col_names) | duplicated(col_names, fromLast = TRUE))]

# for each duplicated column name, merge all columns that have the same name
dup_names %>% map(function(col_name) {
  renamed_test %>%
    select(starts_with(col_name)) %>% 
    gather %>% # bind rows
    select(-1) %>% # merged value is the last column
    set_names(c(col_name)) # rename the column name back to its original name
}) %>% bind_cols

result <- bind_rows(tmp_result, unique_data)

This is tricky when you try to bind the columns because the merged data might have different row number. You can compare the length every time when merging and fill the shorter list by appending 0s.

Upvotes: 1

NelsonGon

Reputation: 13309

Try this. The logic isn't clear: EDIT:: I think the best one can do is simply melt the data like this

library(tidyverse)
df1<-df %>% 
  gather("ID","Value") %>% 
  group_by(ID) %>% 
  arrange(Value)

df1$ID<-str_replace_all(df1$ID,"Col_1.\\d","Col_1")

You could proceed like this but I feel leaving the data melted is better.

library(reshape2)
df1 %>% 
  ungroup() %>% 
  dcast(Value~ID,fun=mean) %>% 
  mutate(Col_2=ifelse(Col_1<=4,5,0)) %>% 
  select(-Value)

Result(melted): The question then is how to deal with the duplicates.

 ID    Value
   <chr> <int>
 1 Col_1     1
 2 Col_1     2
 3 Col_1     3
 4 Col_1     4
 5 Col_1     5
 6 Col_2     5
 7 Col_2     5
 8 Col_2     5
 9 Col_2     5
10 Col_1     6
11 Col_1     7
12 Col_1     8
13 Col_1     9
14 Col_1    10
15 Col_1    11
16 Col_1    12
17 Col_1    13
18 Col_1    14
19 Col_1    15
20 Col_1    16

Original:

  library(tidyverse)
    df %>% 
  gather(key,value,-Col_2) %>% 
  arrange(value) %>% 
  rename(Col_1=value) %>% 
  mutate(Col_2=ifelse(Col_1<=4,5,0)) %>% 
  select(Col_1,everything(),-key)

Result:

      Col_1 Col_2
1      1     5
2      2     5
3      3     5
4      4     5
5      5     0
6      6     0
7      7     0
8      8     0
9      9     0
10    10     0
11    11     0
12    12     0
13    13     0
14    14     0
15    15     0
16    16     0