Reputation: 2584
I have a data frame with many columns like below
Column1 Column2 Column3
Q9Y6Y8 P28074 Q9Y6A4
Q9Y6W5 P28066 Q9Y623
Q9Y6H1 P27695 Q9Y5W9
Q5T1J5 P25786;Q9Y623
Q9Y6A4
Q9Y623;P27695;Q9Y623
Q9Y5W9
Q9Y6Y8
So I want to first put them all together and get their unique like below
Q9Y6Y8
Q9Y6W5
Q9Y6H1
Q5T1J5
Q9Y6A4
Q9Y623
P27695
Q9Y623
Q9Y5W9
Q9Y6Y8
P25786
P28074
P28066
Then I want a combination of all strings two by two like below
Q9Y6Y8 Q9Y6W5
Q9Y6Y8 Q9Y6H1
Q9Y6Y8 Q9Y6A4
Q9Y6Y8 Q5T1J5
Q9Y6Y8 Q9Y6A4
Q9Y6Y8 Q9Y623
Q9Y6Y8 P27695
Q9Y6Y8 Q9Y623
.
.
.
Q9Y6W5 Q9Y6H1
Q9Y6W5 Q9Y6A4
Q9Y6W5 Q5T1J5
.
.
.
until all strings are in paris once
Upvotes: 1
Views: 55
Reputation: 886948
We can do this by unlist
ing the data.frame (as data.frame is a list
) to a vector
, split by ;
, then unlist
the list
output (from strsplit
) and get the unique
elements as a vector
.
Un1 <- unique(unlist(strsplit(unlist(df1), ";")))
From this, we can get all the combinations using expand.grid
expand.grid(Un1, Un1)
Or if we need only limited combinations, combn
can be used.
t(combn(Un1, 2))
# [,1] [,2]
# [1,] "Q9Y6Y8" "Q9Y6W5"
# [2,] "Q9Y6Y8" "Q9Y6H1"
# [3,] "Q9Y6Y8" "Q5T1J5"
# [4,] "Q9Y6Y8" "Q9Y6A4"
# [5,] "Q9Y6Y8" "Q9Y623"
# [6,] "Q9Y6Y8" "P27695"
# [7,] "Q9Y6Y8" "Q9Y5W9"
# [8,] "Q9Y6Y8" "P28074"
# [9,] "Q9Y6Y8" "P28066"
#[10,] "Q9Y6Y8" "P25786"
#[11,] "Q9Y6W5" "Q9Y6H1"
#[12,] "Q9Y6W5" "Q5T1J5"
#[13,] "Q9Y6W5" "Q9Y6A4"
#[14,] "Q9Y6W5" "Q9Y623"
#[15,] "Q9Y6W5" "P27695"
#[16,] "Q9Y6W5" "Q9Y5W9"
#[17,] "Q9Y6W5" "P28074"
#[18,] "Q9Y6W5" "P28066"
#[19,] "Q9Y6W5" "P25786"
#[20,] "Q9Y6H1" "Q5T1J5"
#[21,] "Q9Y6H1" "Q9Y6A4"
#[22,] "Q9Y6H1" "Q9Y623"
#[23,] "Q9Y6H1" "P27695"
#[24,] "Q9Y6H1" "Q9Y5W9"
#[25,] "Q9Y6H1" "P28074"
#[26,] "Q9Y6H1" "P28066"
#[27,] "Q9Y6H1" "P25786"
#[28,] "Q5T1J5" "Q9Y6A4"
#[29,] "Q5T1J5" "Q9Y623"
#[30,] "Q5T1J5" "P27695"
#[31,] "Q5T1J5" "Q9Y5W9"
#[32,] "Q5T1J5" "P28074"
#[33,] "Q5T1J5" "P28066"
#[34,] "Q5T1J5" "P25786"
#[35,] "Q9Y6A4" "Q9Y623"
#[36,] "Q9Y6A4" "P27695"
#[37,] "Q9Y6A4" "Q9Y5W9"
#[38,] "Q9Y6A4" "P28074"
#[39,] "Q9Y6A4" "P28066"
#[40,] "Q9Y6A4" "P25786"
#[41,] "Q9Y623" "P27695"
#[42,] "Q9Y623" "Q9Y5W9"
#[43,] "Q9Y623" "P28074"
#[44,] "Q9Y623" "P28066"
#[45,] "Q9Y623" "P25786"
#[46,] "P27695" "Q9Y5W9"
#[47,] "P27695" "P28074"
#[48,] "P27695" "P28066"
#[49,] "P27695" "P25786"
#[50,] "Q9Y5W9" "P28074"
#[51,] "Q9Y5W9" "P28066"
#[52,] "Q9Y5W9" "P25786"
#[53,] "P28074" "P28066"
#[54,] "P28074" "P25786"
#[55,] "P28066" "P25786"
NOTE: Here I assumed the columns are all character
class.
Upvotes: 3