Reputation: 91
I have a data frame of amino acid sites, and want to create a new data frame of each pairwise combination of these sites.
The original data will look something like this:
df<-cbind(letters[1:5], letters[6:10], letters[11:15])
df
[,1] [,2] [,3]
[1,] "a" "f" "k"
[2,] "b" "g" "l"
[3,] "c" "h" "m"
[4,] "d" "i" "n"
[5,] "e" "j" "o"
And what I would like is this:
newdf<-cbind(paste(df[,1],df[,2],sep=""),paste(df[,1],df[,3],sep=""),(paste(df[,2],df[,3],sep="")))
newdf
[,1] [,2] [,3]
[1,] "af" "ak" "fk"
[2,] "bg" "bl" "gl"
[3,] "ch" "cm" "hm"
[4,] "di" "dn" "in"
[5,] "ej" "eo" "jo"
The actual data may have hundreds of rows and/or columns, so obviously I need a less manual way of doing this. Any help is much appreciated, I am but a humble biologist and my skill set in this area is rather limited.
Upvotes: 9
Views: 2179
Reputation: 109874
Josh and Joshua's answers are better but I thought I'd give my approach:
This requires downloading qdap
varsion 1.1.0 using the paste2
function:
library(qdap)
ind <- unique(t(apply(expand.grid(1:3, 1:3), 1, sort)))
ind <- ind[ind[, 1] != ind[, 2], ]
sapply(1:nrow(ind), function(i) paste2(df[, unlist(ind[i, ])], sep=""))
Though to steal from their answers this would be much more readable:
ind <- t(combn(seq_len(ncol(df)), 2))
sapply(1:nrow(ind), function(i) paste2(df[, unlist(ind[i, ])], sep=""))
Upvotes: 2
Reputation: 8267
Remember that you will get a lot of columns in your new data.frame, given that you say you have hundreds of columns in the original data.frame: if the original data contain n columns, then the new one will contain n(n-1)/2 columns - this scales quadratically.
Upvotes: -1
Reputation: 176648
You can use the FUN
argument to combn
to paste together the columns from each combination:
combn(ncol(df),2,FUN=function(i) apply(df[,i],1,paste0,collapse=""))
Upvotes: 9
Reputation: 162321
A combination of combn()
and apply()
will get you all of the unordered pairwise combos:
df <- cbind(letters[1:5], letters[6:10], letters[11:15])
apply(X = combn(seq_len(ncol(df)), 2),
MAR = 2,
FUN = function(jj) {
apply(df[, jj], 1, paste, collapse="")
}
)
# [,1] [,2] [,3]
# [1,] "af" "ak" "fk"
# [2,] "bg" "bl" "gl"
# [3,] "ch" "cm" "hm"
# [4,] "di" "dn" "in"
# [5,] "ej" "eo" "jo"
(If what's going on in the above isn't immediately clear, you might want to have a quick look at the object returned by combn(seq_len(ncol(df)), 2)
. Its columns enumerate all unordered pairwise combos integers between 1 and n
, where n
is the number of columns in your data frame.)
Upvotes: 12