Reputation: 866
I have a dataset with thousands of columns with some of the column having identical column name. I want to merge column with same column name such that the values are appended as rows. And, for the columns that don't have column with same column name, 0
is appended in rows.
Clarification: Below is just an example, the real data set I have has thousands of column and many of those have column name that are duplicate and many aren't.
Sample Input Data
Col_1 Col_1 Col_1 Col_1 Col_2
1 2 3 4 5
5 6 7 8 5
9 10 11 12 5
13 14 15 16 5
Sample Output Data
Col_1 Col_2
1 5
2 5
3 5
4 5
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
Upvotes: 1
Views: 1866
Reputation: 3195
Here's a pretty complicated answer. Some of the code is a bit clunky, but it is a general solution.
library(tidyverse)
library(magrittr)
# function to create lookup table, matching duplicate column names to syntactically valid names
rel <- function(x) {x %>%
colnames %>%
make.names(., unique = TRUE) %>%
as.data.frame() %>%
mutate(names(x)) %>%
setNames(c("New", "Old")) }
# create lookup table to match old and new column names
lookup <- rel(df)
# gather df into long format
df_long <- df %>%
setNames(lookup$New) %>%
gather(var, value)
# match new names to original names
df_colnames <- lapply(1:length(unique(lookup$Old)), function(x) grepl(unique(lookup$Old)[x], df_long$var)) %>%
setNames(unique(lookup$Old)) %>%
as.data.frame
# vector replacing new syntactically valid names with original names
column <- lapply(names(df_colnames), function(x) ifelse(df_colnames[, x], x, F)) %>%
setNames(unique(lookup$Old)) %>%
as.data.frame %>%
unite(comb, sep = "") %>%
magrittr::extract(, "comb") %>%
gsub("FALSE", "", .)
# put original columns into lists
final_list <- df_long %>%
mutate(var = column) %>%
arrange(var, value) %>%
split(.$var) %>%
map(~select_at(.x, c("value"))) %>%
lapply(function(x) x$value)
# create vectors of zeros to append to original data
final_list_extend <- sapply(abs(unlist(lapply(final_list, length)) - max(unlist(lapply(final_list, length)))), function(x) rep(0, x))
# append zeros to original data and rename columns to match original names
output <- sapply(1:length(final_list), function(x) c(final_list[[x]], final_list_extend[[x]])) %>%
as_data_frame %>%
setNames(unique(lookup$Old))
#show result
output
# A tibble: 16 x 2
Col_1 Col_2
<dbl> <dbl>
1 1 5
2 2 5
3 3 5
4 4 5
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
10 10 0
11 11 0
12 12 0
13 13 0
14 14 0
15 15 0
16 16 0
df <- read.table(header = T, text = "
Col_1 Col_1 Col_1 Col_1 Col_2
1 2 3 4 5
5 6 7 8 5
9 10 11 12 5
13 14 15 16 5") %>%
setNames(c("Col_1", "Col_1", "Col_1", "Col_1", "Col_2"))
Upvotes: 0
Reputation: 2366
Here is my way that involves some manual work. Let's assume your dataset is in the variable test
# may only require some of the packages of tidyverse
library(tidyverse)
# this will give all column unique names
renamed_test <- test %>%
set_names(str_c(names(test), 1:ncol(test)))
# then for each duplicated column name, they now start with the same prefix;
# so select all these columns and use gather to append them one after another,
# and finally rename the merged column back to the original name
bound_col_1 <- renamed_test %>%
select(starts_with("Col_1")) %>%
gather %>%
transmute(Col_1 = value)
# repeat this for 'Col_2'
# .....
# last, column bind all these results
bind_cols(bound_col_1, bound_col_2, [potentiall other variables])
I generalized the solution so it will automatically find all duplicated columns and row bind each
library(tidyverse)
# testing data
test <- data.frame(c(1,2,3), c(7,8,9), c(4,5,6), c(10,11,12), c(100, 101, 102)) %>%
set_names(c("Col_1", "Col_2", "Col_1", "Col_2", "Col_3"))
col_names <- names(test)
# find all columns that have duplicated columns
dup_names <- col_names[duplicated(col_names)]
# make the column names unique so it will work with tidyr
renamed_test <- test %>%
set_names(str_c(col_names, "-", 1:ncol(test)))
unique_data <- test[!(duplicated(col_names) | duplicated(col_names, fromLast = TRUE))]
# for each duplicated column name, merge all columns that have the same name
dup_names %>% map(function(col_name) {
renamed_test %>%
select(starts_with(col_name)) %>%
gather %>% # bind rows
select(-1) %>% # merged value is the last column
set_names(c(col_name)) # rename the column name back to its original name
}) %>% bind_cols
result <- bind_rows(tmp_result, unique_data)
This is tricky when you try to bind the columns because the merged data might have different row number. You can compare the length every time when merging and fill the shorter list by appending 0s.
Upvotes: 1
Reputation: 13309
Try this. The logic isn't clear: EDIT:: I think the best one can do is simply melt the data like this
library(tidyverse)
df1<-df %>%
gather("ID","Value") %>%
group_by(ID) %>%
arrange(Value)
df1$ID<-str_replace_all(df1$ID,"Col_1.\\d","Col_1")
You could proceed like this but I feel leaving the data melted is better.
library(reshape2)
df1 %>%
ungroup() %>%
dcast(Value~ID,fun=mean) %>%
mutate(Col_2=ifelse(Col_1<=4,5,0)) %>%
select(-Value)
Result(melted): The question then is how to deal with the duplicates.
ID Value
<chr> <int>
1 Col_1 1
2 Col_1 2
3 Col_1 3
4 Col_1 4
5 Col_1 5
6 Col_2 5
7 Col_2 5
8 Col_2 5
9 Col_2 5
10 Col_1 6
11 Col_1 7
12 Col_1 8
13 Col_1 9
14 Col_1 10
15 Col_1 11
16 Col_1 12
17 Col_1 13
18 Col_1 14
19 Col_1 15
20 Col_1 16
Original:
library(tidyverse)
df %>%
gather(key,value,-Col_2) %>%
arrange(value) %>%
rename(Col_1=value) %>%
mutate(Col_2=ifelse(Col_1<=4,5,0)) %>%
select(Col_1,everything(),-key)
Result:
Col_1 Col_2
1 1 5
2 2 5
3 3 5
4 4 5
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
10 10 0
11 11 0
12 12 0
13 13 0
14 14 0
15 15 0
16 16 0
Upvotes: 0