Adamm
Adamm

Reputation: 2306

Binding rows of two data frames (merging by column names) with duplicated column names

I'd like to bind rows of two df's by column names, if column doesn't exist in 2nd df, first should have NA's. Unfortunately dplyr::bind_rows has a problem with duplicated column names. Could you help me how to solve this problem?

What I did so far?

First some reproducible data:

df1 <- data.frame(replicate(6,sample(0:1,10,rep=TRUE)))
df2 <- data.frame(replicate(3,sample(0:1,10,rep=TRUE)))
colnames(df1) <- c('A','A','A','B','C','E')
colnames(df2) <- c('A','B','C')

df1
   A A A B C E
1  0 1 0 1 1 1
2  1 1 1 0 0 0
3  0 0 1 1 0 1
4  0 0 1 0 1 1
5  0 1 0 1 1 0
6  1 0 1 1 1 1
7  0 0 1 1 1 0
8  0 1 0 1 1 0
9  0 1 0 0 1 0
10 1 1 0 1 1 0

df2
   A B C
1  1 1 0
2  0 1 1
3  1 1 0
4  1 0 0
5  0 0 1
6  0 0 1
7  0 1 1
8  0 0 0
9  0 0 1
10 0 1 1

What I'd like to get?

   A A A B C E
1  0 1 0 1 1 1
2  1 1 1 0 0 0
3  0 0 1 1 0 1
4  0 0 1 0 1 1
5  0 1 0 1 1 0
6  1 0 1 1 1 1
7  0 0 1 1 1 0
8  0 1 0 1 1 0
9  0 1 0 0 1 0
10 1 1 0 1 1 0
11 1 1 1 1 0 NA
12 0 0 0 1 1 NA
13 1 1 1 1 0 NA
14 1 1 1 0 0 NA
15 0 0 0 0 1 NA
16 0 0 0 0 1 NA
17 0 0 0 1 1 NA
18 0 0 0 0 0 NA
19 0 0 0 0 1 NA
20 0 0 0 1 1 NA

I decided to use bind_rows from dplyr, but:

result <- bind_rows(mutate_all(df1, as.character), mutate_all(df2, as.character))
Error: Columns `A`, `A` must have unique names
Call `rlang::last_error()` to see a backtrace

Thanks for help in advance!

Upvotes: 4

Views: 1090

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388907

Another option could be convert the dataframe to list objects and then cbind them using cbind.fill from rowr package with fill = NA

library(rowr)

new_df <- do.call(cbind.fill, c(mapply(c, as.list(df1),
         as.list(df2)[match(names(df1), names(df2))]), fill = NA))

but this loses column names which you can give it back from df1.

names(new_df) <-  names(df1)
new_df

#   A A A B C  E
#1  0 1 0 0 1  0
#2  0 1 0 0 0  0
#3  0 0 1 1 1  1
#4  0 0 1 0 1  0
#5  1 0 0 0 0  0
#6  0 0 1 0 1  1
#7  0 1 0 0 1  0
#8  0 1 0 0 0  0
#9  0 1 1 0 1  1
#10 1 1 1 0 1  1
#11 0 0 0 1 0 NA
#12 1 1 1 0 0 NA
#13 1 1 1 1 0 NA
#14 0 0 0 1 0 NA
#15 0 0 0 1 1 NA
#16 1 1 1 1 0 NA
#17 0 0 0 1 1 NA
#18 1 1 1 0 0 NA
#19 1 1 1 0 1 NA
#20 1 1 1 0 1 NA

Upvotes: 2

markus
markus

Reputation: 26343

An option using rbindlist from data.table

library(data.table)
cols <- names(df1)[names(df1) %in% names(df2)]
out <- setDF(rbindlist(list(df1, setNames(df2[cols], cols)), fill = TRUE))
out
#   A A A B C  E
#1  0 1 0 1 1  0
#2  1 1 1 0 1  0
#3  0 1 1 0 1  0
#4  0 0 1 0 1  0
#5  1 0 0 1 0  1
#6  1 1 1 1 0  1
#7  0 0 0 1 0  0
#8  0 0 0 1 0  0
#9  1 1 0 1 0  0
#10 0 1 0 1 1  0
#11 0 0 0 1 1 NA
#12 1 1 1 1 0 NA
#13 0 0 0 0 1 NA
#14 0 0 0 0 1 NA
#15 0 0 0 0 1 NA
#16 1 1 1 1 1 NA
#17 0 0 0 0 0 NA
#18 1 1 1 1 1 NA
#19 0 0 0 1 0 NA
#20 0 0 0 1 1 NA

The part df2[cols] would result in the following dataframe with unique names

#   A A.1 A.2 B C
#1  0   0   0 1 1
#2  1   1   1 1 0
#3  0   0   0 0 1
# ...

That's why we call setNames(df2[cols], cols) to immediately rename the columns. setDF ensures that the result is a data.frame and not a data.table.


Using bind_rows you'd end up with only four columns

head(bind_rows(df1, setNames(df2[cols], cols)), 3)
#  A B C E
#1 1 1 0 0
#2 0 0 0 0
#3 0 0 1 1

Upvotes: 3

Related Questions