Ex-StrConsultant
Ex-StrConsultant

Reputation: 17

R - Expand Grid Without Duplicates

I need a function similar to expand.grid but without the combinations of duplicate elements.

Here is a simplified version of my problem.

X1 = c("x","y","z")
X2 = c("A","B","C")
X3 = c("y","C","G")

d <- expand.grid(X1,X2,X3)

d
   Var1 Var2 Var3
1     x    A    y
2     y    A    y
3     z    A    y
4     x    B    y
.     .    .    .
.     .    .    .
.     .    .    .
23    y    B    G
24    z    B    G
25    x    C    G
26    y    C    G
27    z    C    G

d has 27 rows. But 6 of these contain duplicate values which I do not need Rows: 2, 5, 8, 16, 17 & 18

Is there a way to get the other 21 rows which does not contain any duplicates.

Note that vectors have more than 3 elements (c("x","y","z","k","m"...), up to 50) and number of vectors is more than 3 in the real case. (X4, X5, X6... up to 11 ). Because of this expanded object is getting real large and RAM cannot handle it.

Upvotes: 1

Views: 1539

Answers (2)

r2evans
r2evans

Reputation: 160407

(Sorry, I just realized that your problem is as much a size problem, so removing them post-generation may not be feasible. For that, this may not be the best answer, but I'll keep it around for smaller-and-related questions.)

base R

I hard-code "3", but you can use ncol(d) and/or ncol(d)-1 for programmatic use.

d[lengths(apply(d, 1, unique)) > 2, ]
#    Var1 Var2 Var3
# 1     x    A    y
# 3     z    A    y
# 4     x    B    y
# 6     z    B    y
# 7     x    C    y
# 9     z    C    y
# 10    x    A    C
# 11    y    A    C
# 12    z    A    C
# 13    x    B    C
# 14    y    B    C
# 15    z    B    C
# 19    x    A    G
# 20    y    A    G
# 21    z    A    G
# 22    x    B    G
# 23    y    B    G
# 24    z    B    G
# 25    x    C    G
# 26    y    C    G
# 27    z    C    G

(The row names are not reset, you can see the gaps to verify it is not 27 rows.)

And to verify, here are the rows with dupes:

d[lengths(apply(d, 1, unique)) < 3, ]
#    Var1 Var2 Var3
# 2     y    A    y
# 5     y    B    y
# 8     y    C    y
# 16    x    C    C
# 17    y    C    C
# 18    z    C    C

Upvotes: 2

Joseph Wood
Joseph Wood

Reputation: 7597

In RcppAlgos*, there is a function called comboGrid that does the trick:

library(RcppAlgos) ## as of v2.4.3
comboGrid(X1, X2, X3, repetition = F)
#      Var1 Var2 Var3
#  [1,] "x"  "A"  "C" 
#  [2,] "x"  "A"  "G" 
#  [3,] "x"  "A"  "y" 
#  [4,] "x"  "B"  "C" 
#  [5,] "x"  "B"  "G" 
#  [6,] "x"  "B"  "y" 
#  [7,] "x"  "C"  "G" 
#  [8,] "x"  "C"  "y" 
#  [9,] "y"  "A"  "C" 
# [10,] "y"  "A"  "G" 
# [11,] "y"  "B"  "C" 
# [12,] "y"  "B"  "G" 
# [13,] "y"  "C"  "G" 
# [14,] "z"  "A"  "C" 
# [15,] "z"  "A"  "G" 
# [16,] "z"  "A"  "y" 
# [17,] "z"  "B"  "C" 
# [18,] "z"  "B"  "G" 
# [19,] "z"  "B"  "y" 
# [20,] "z"  "C"  "G" 
# [21,] "z"  "C"  "y"

Large Test

set.seed(42)
rnd_lst <- lapply(1:11, function(x) {
    sort(sample(LETTERS, sample(26, 1)))
})

## Number of results that expand.grid would return if your machine
## had enough memory... over 300 trillion!!!
prettyNum(prod(lengths(rnd_lst)), big.mark = ",")
# [1] "365,634,846,720"

exp_grd_test <- expand.grid(rnd_lst)
# Error: vector memory exhausted (limit reached?)

system.time(cmb_grd_test <- comboGrid(rnd_lst, repetition=FALSE))
#  user  system elapsed 
# 9.866   0.330  10.196 

dim(cmb_grd_test)
# [1] 3036012      11

head(cmb_grd_test)
#     Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11
# [1,] "A"  "E"  "C"  "B"  "D"  "G"  "F"  "H"  "J"  "I"   "K"  
# [2,] "A"  "E"  "C"  "B"  "D"  "G"  "F"  "H"  "J"  "I"   "L"  
# [3,] "A"  "E"  "C"  "B"  "D"  "G"  "F"  "H"  "J"  "I"   "M"  
# [4,] "A"  "E"  "C"  "B"  "D"  "G"  "F"  "H"  "J"  "I"   "N"  
# [5,] "A"  "E"  "C"  "B"  "D"  "G"  "F"  "H"  "J"  "I"   "O"  
# [6,] "A"  "E"  "C"  "B"  "D"  "G"  "F"  "H"  "J"  "I"   "P"

* I am the author of RcppAlgos

Upvotes: 6

Related Questions