Reputation: 829
This appears to be a simple question but is causing me a lot of headache (it's not homework, but a sticking point in real research)
I have a single list with 2266 levels. The list looks somewhat like this:
[1] ~/folder1/folder1/a.bin
[2] ~/folder1/folder1/b.bin
[3] ~/folder1/folder1/c.bin
[4] ~/folder1/folder2/a.bin
[5] ~/folder1/folder2/b.bin
[6] ~/folder1/folder2/c.bin
To explain: the list is filenames of binary files that I am reading in using the readBin
function. I want to compare every row with every other row, so what I want are two columns which contain all unique combinations in them, derived from my single column.
(choose 2266,2)
tells me that there are 2566245 combinations of our single column into two.
`expand.grid() appears to get me half way there. But there are four times as many combinations as I require: I get two rows each 5132490. This means that there are duplications: 1 + 2 and 2 + 1 are the same thing for my purpose.
expand.grid.df
with unique=TRUE
also doesn't seem to help.
My last idea was md5 hashing each of the 5 million rows and trying to detect duplicates that way.
I am looking for some way of making two lists which cover the 2566245 combinations of my list. Alternatively some way of removing all the duplicates. I guess I am not absolutely wedded to using R and have investigated awk or sed to do the same thing. No success yet though.
Upvotes: 2
Views: 229
Reputation: 121608
I think you are looking for combn
looking like expand.grid
, using @Arun data,
v <- c("~/folder1/folder1/a.bin",
"~/folder1/folder1/b.bin",
"~/folder1/folder1/c.bin",
"~/folder1/folder2/a.bin",
"~/folder1/folder2/b.bin",
"~/folder1/folder2/c.bin")
do.call(rbind,combn(v,2,simplify=F))
[,1] [,2]
[1,] "~/folder1/folder1/a.bin" "~/folder1/folder1/b.bin"
[2,] "~/folder1/folder1/a.bin" "~/folder1/folder1/c.bin"
[3,] "~/folder1/folder1/a.bin" "~/folder1/folder2/a.bin"
[4,] "~/folder1/folder1/a.bin" "~/folder1/folder2/b.bin"
[5,] "~/folder1/folder1/a.bin" "~/folder1/folder2/c.bin"
[6,] "~/folder1/folder1/b.bin" "~/folder1/folder1/c.bin"
[7,] "~/folder1/folder1/b.bin" "~/folder1/folder2/a.bin"
[8,] "~/folder1/folder1/b.bin" "~/folder1/folder2/b.bin"
[9,] "~/folder1/folder1/b.bin" "~/folder1/folder2/c.bin"
[10,] "~/folder1/folder1/c.bin" "~/folder1/folder2/a.bin"
[11,] "~/folder1/folder1/c.bin" "~/folder1/folder2/b.bin"
[12,] "~/folder1/folder1/c.bin" "~/folder1/folder2/c.bin"
[13,] "~/folder1/folder2/a.bin" "~/folder1/folder2/b.bin"
[14,] "~/folder1/folder2/a.bin" "~/folder1/folder2/c.bin"
[15,] "~/folder1/folder2/b.bin" "~/folder1/folder2/c.bin"
EDIT
I think that the path format over complicte the problem. If we use for example letters in place of file names, we get :
do.call(rbind,combn(letters[1:4],2,simplify=F))
[,1] [,2]
[1,] "a" "b"
[2,] "a" "c"
[3,] "a" "d"
[4,] "b" "c"
[5,] "b" "d"
[6,] "c" "d"
So As you see there is no duplictated.
Upvotes: 2