Kevin Ushey
Kevin Ushey

Reputation: 21285

Merge and paste duplicate columns in R

Suppose I have two data frames with some common variable x:

df1 <- data.frame( 
  x=c(1, 2, 3, 4), 
  y=c("a", "b", "c", "d") 
)

df2 <- data.frame( 
  x=c(1, 1, 2, 2, 3, 4, 5), 
  z=c("A", "B", "C", "D", "E", "F", "G") 
)

We can assume that each entry of the variable we're merging over, x, appears exactly once in df1; however, it may appear an arbitrary number of times in df2.

I want to merge df2 'into' df1, while preserving df1. Is there a fast way of merging these two data frames such that the merged output would be of the form (for example):

df_merged <- data.frame( 
  x=c(1, 2, 3, 4), 
  y=c("a", "b", "c", "d"), 
  z=c("A B", "C D", "E", "F")
)

Essentially, I want df_merged to be a composition of the original df1, in addition to any variables in df2 coerced to match the format of df1. The various incantations of merge will append new rows to the merged output, which I want to avoid.

We can assume that each entry of the variable we're merging over, x, appears exactly once.

Speed is also a priority since I'll be merging fairly large data frames.

Upvotes: 3

Views: 1581

Answers (3)

Kevin Ushey
Kevin Ushey

Reputation: 21285

I'm submitting this question with my own potential answer, but it is fairly slow and I'm curious what other methods might be available.

by <- "x"
df2_processed <- as.data.frame( 
    sapply( names(df2), function(x) {
      tapply( df2[[x]], df2[[by]], function(xx) {
        if( x == by ) {
          return(xx[1])
        } else {
          paste(xx, collapse=" ")
        }
      })
    }), optional=TRUE, stringsAsFactors=FALSE )

merge( df1, df2_processed, all.x=TRUE )

Upvotes: 0

Matthew Plourde
Matthew Plourde

Reputation: 44614

Another option:

df2.z <- with(df2, tapply(z, x, paste, collapse=' '))
transform(df1, z=df2.z[match(x, names(df2.z))])

#   x y   z
# 1 1 a A B
# 2 2 b C D
# 3 3 c   E
# 4 4 d   F

If df1$x is in order, then use df2.z[names(df2.z) %in% x] in the transform statement.

Upvotes: 1

IRTFM
IRTFM

Reputation: 263342

merge( df1, 
       aggregate(df2$z , df2[1], FUN=paste, collapse=" ", sep=""), 
       by.x="x", by.y=1)
  x y   x
1 1 a A B
2 2 b C D
3 3 c   E
4 4 d   F
Warning message:
In merge.data.frame(df1, aggregate(df2$z, df2[1], FUN = paste, collapse = " ",  :
  column name ‘x’ is duplicated in the result
> M1 <- .Last.value
> names(M1)[3] <- "z"
> M1
  x y   z
1 1 a A B
2 2 b C D
3 3 c   E
4 4 d   F

Upvotes: 1

Related Questions