Jill Hollenbach
Jill Hollenbach

Reputation: 91

Paste together each pair of columns in a data frame in R?

I have a data frame of amino acid sites, and want to create a new data frame of each pairwise combination of these sites.

The original data will look something like this:

df<-cbind(letters[1:5], letters[6:10], letters[11:15])
df
 [,1] [,2] [,3] 
[1,] "a"  "f"  "k" 
[2,] "b"  "g"  "l" 
[3,] "c"  "h"  "m" 
[4,] "d"  "i"  "n" 
[5,] "e"  "j"  "o" 

And what I would like is this:

newdf<-cbind(paste(df[,1],df[,2],sep=""),paste(df[,1],df[,3],sep=""),(paste(df[,2],df[,3],sep="")))
newdf
     [,1] [,2] [,3]
[1,] "af" "ak" "fk"
[2,] "bg" "bl" "gl"
[3,] "ch" "cm" "hm"
[4,] "di" "dn" "in"
[5,] "ej" "eo" "jo"

The actual data may have hundreds of rows and/or columns, so obviously I need a less manual way of doing this. Any help is much appreciated, I am but a humble biologist and my skill set in this area is rather limited.

Upvotes: 9

Views: 2179

Answers (4)

Tyler Rinker
Tyler Rinker

Reputation: 109874

Josh and Joshua's answers are better but I thought I'd give my approach:

This requires downloading qdap varsion 1.1.0 using the paste2 function:

library(qdap)

ind <- unique(t(apply(expand.grid(1:3, 1:3), 1, sort)))
ind <- ind[ind[, 1] != ind[, 2], ]
sapply(1:nrow(ind), function(i) paste2(df[, unlist(ind[i, ])], sep=""))

Though to steal from their answers this would be much more readable:

ind <- t(combn(seq_len(ncol(df)), 2))
sapply(1:nrow(ind), function(i) paste2(df[, unlist(ind[i, ])], sep=""))

Upvotes: 2

Stephan Kolassa
Stephan Kolassa

Reputation: 8267

Remember that you will get a lot of columns in your new data.frame, given that you say you have hundreds of columns in the original data.frame: if the original data contain n columns, then the new one will contain n(n-1)/2 columns - this scales quadratically.

Upvotes: -1

Joshua Ulrich
Joshua Ulrich

Reputation: 176648

You can use the FUN argument to combn to paste together the columns from each combination:

combn(ncol(df),2,FUN=function(i) apply(df[,i],1,paste0,collapse=""))

Upvotes: 9

Josh O&#39;Brien
Josh O&#39;Brien

Reputation: 162321

A combination of combn() and apply() will get you all of the unordered pairwise combos:

df <- cbind(letters[1:5], letters[6:10], letters[11:15])

apply(X = combn(seq_len(ncol(df)), 2), 
      MAR = 2, 
      FUN = function(jj) {
          apply(df[, jj], 1, paste, collapse="")
      }      
)
#      [,1] [,2] [,3]
# [1,] "af" "ak" "fk"
# [2,] "bg" "bl" "gl"
# [3,] "ch" "cm" "hm"
# [4,] "di" "dn" "in"
# [5,] "ej" "eo" "jo"

(If what's going on in the above isn't immediately clear, you might want to have a quick look at the object returned by combn(seq_len(ncol(df)), 2). Its columns enumerate all unordered pairwise combos integers between 1 and n, where n is the number of columns in your data frame.)

Upvotes: 12

Related Questions