Frank Zafka
Frank Zafka

Reputation: 829

Make two columns from one column covering all combinations

This appears to be a simple question but is causing me a lot of headache (it's not homework, but a sticking point in real research)

I have a single list with 2266 levels. The list looks somewhat like this:

[1] ~/folder1/folder1/a.bin
[2] ~/folder1/folder1/b.bin
[3] ~/folder1/folder1/c.bin
[4] ~/folder1/folder2/a.bin
[5] ~/folder1/folder2/b.bin
[6] ~/folder1/folder2/c.bin

To explain: the list is filenames of binary files that I am reading in using the readBin function. I want to compare every row with every other row, so what I want are two columns which contain all unique combinations in them, derived from my single column.

(choose 2266,2) tells me that there are 2566245 combinations of our single column into two.

`expand.grid() appears to get me half way there. But there are four times as many combinations as I require: I get two rows each 5132490. This means that there are duplications: 1 + 2 and 2 + 1 are the same thing for my purpose.

expand.grid.df with unique=TRUE also doesn't seem to help.

My last idea was md5 hashing each of the 5 million rows and trying to detect duplicates that way.

I am looking for some way of making two lists which cover the 2566245 combinations of my list. Alternatively some way of removing all the duplicates. I guess I am not absolutely wedded to using R and have investigated awk or sed to do the same thing. No success yet though.

Upvotes: 2

Views: 229

Answers (1)

agstudy
agstudy

Reputation: 121608

I think you are looking for combn looking like expand.grid, using @Arun data,

v <- c("~/folder1/folder1/a.bin", 
       "~/folder1/folder1/b.bin", 
       "~/folder1/folder1/c.bin", 
       "~/folder1/folder2/a.bin", 
       "~/folder1/folder2/b.bin", 
       "~/folder1/folder2/c.bin")
do.call(rbind,combn(v,2,simplify=F))

    [,1]                      [,2]                     
 [1,] "~/folder1/folder1/a.bin" "~/folder1/folder1/b.bin"
 [2,] "~/folder1/folder1/a.bin" "~/folder1/folder1/c.bin"
 [3,] "~/folder1/folder1/a.bin" "~/folder1/folder2/a.bin"
 [4,] "~/folder1/folder1/a.bin" "~/folder1/folder2/b.bin"
 [5,] "~/folder1/folder1/a.bin" "~/folder1/folder2/c.bin"
 [6,] "~/folder1/folder1/b.bin" "~/folder1/folder1/c.bin"
 [7,] "~/folder1/folder1/b.bin" "~/folder1/folder2/a.bin"
 [8,] "~/folder1/folder1/b.bin" "~/folder1/folder2/b.bin"
 [9,] "~/folder1/folder1/b.bin" "~/folder1/folder2/c.bin"
[10,] "~/folder1/folder1/c.bin" "~/folder1/folder2/a.bin"
[11,] "~/folder1/folder1/c.bin" "~/folder1/folder2/b.bin"
[12,] "~/folder1/folder1/c.bin" "~/folder1/folder2/c.bin"
[13,] "~/folder1/folder2/a.bin" "~/folder1/folder2/b.bin"
[14,] "~/folder1/folder2/a.bin" "~/folder1/folder2/c.bin"
[15,] "~/folder1/folder2/b.bin" "~/folder1/folder2/c.bin"

EDIT

I think that the path format over complicte the problem. If we use for example letters in place of file names, we get :

do.call(rbind,combn(letters[1:4],2,simplify=F))
     [,1] [,2]
[1,] "a"  "b" 
[2,] "a"  "c" 
[3,] "a"  "d" 
[4,] "b"  "c" 
[5,] "b"  "d" 
[6,] "c"  "d"  

So As you see there is no duplictated.

Upvotes: 2

Related Questions