Valentin
Valentin

Reputation: 61

R: Merge two data frames based on value in column and return all values of both data frames

Let's say I have the following dfs

df1:
a b c d
1 2 3 4
4 3 3 4
9 7 3 4

df2:
a b c d
1 2 3 4
2 2 3 4
3 2 3 4

Now I want to merge both dfs conditional of column "a" to give me the following df

a b c d
1 2 3 4
4 3 3 4
9 7 3 4
2 2 3 4
3 2 3 4

In my dataset i tried using

merge <- merge(x = df1, y = df2, by = "a", all = TRUE)

However, while df1 has 50,000 entries and df2 has 100,000 entries and there are definately matching values in column a the merged df has over one million entries. I do not understand this. As I understand there should be max. 150,000 entries in the merged df and this is the case when no values in column a are equal between the two dfs.

Upvotes: 2

Views: 1388

Answers (4)

ThomasIsCoding
ThomasIsCoding

Reputation: 102609

here is another base R solution using rbind + %in%

dfout <- rbind(df1,subset(df2,!a %in% df1$a))

such that

> rbind(df1,subset(df2,!a %in% df1$a))
   a b c d
1  1 2 3 4
2  4 3 3 4
3  9 7 3 4
21 2 2 3 4
31 3 2 3 4

Upvotes: 0

Yuriy Saraykin
Yuriy Saraykin

Reputation: 8880

it is possible so

dplyr::union(df1, df2)

Upvotes: 0

akrun
akrun

Reputation: 887811

With tidyverse, we can do bind_rows and distinct

library(dplyr)
bind_rows(df1, df2) %>%
     distinct

data

df1 <- structure(list(a = c(1, 4, 9), b = c(2, 3, 7), c = c(3, 3, 3), 
    d = c(4, 4, 4)), class = "data.frame", row.names = c(NA, 
-3L))

df2 <- structure(list(a = c(1, 2, 3), b = c(2, 2, 2), c = c(3, 3, 3), 
    d = c(4, 4, 4)), class = "data.frame", row.names = c(NA, 
-3L))

Upvotes: 1

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21442

I think what you want to do is not mergebut rather rbind the two dataframes and remove the duplicated rows:

DATA:

df1 <- data.frame(a = c(1,4,9),
                  b = c(2,3,7),
                  c = c(3,3,3),
                  d = c(4,4,4))
df2 <- data.frame(a = c(1,2,3),
                  b = c(2,2,2),
                  c = c(3,3,3),
                  d = c(4,4,4))

SOLUTION:

Row-bind df1and df2:

df3 <- rbind(df1, df2)

Remove the duplicate rows:

df3 <- df3[!duplicated(df3), ]

RESULT:

df3
  a b c d
1 1 2 3 4
2 4 3 3 4
3 9 7 3 4
5 2 2 3 4
6 3 2 3 4

Upvotes: 3

Related Questions