Rick Schmidt
Rick Schmidt

Reputation: 21

finding pairs of duplicate columns in R

thank you for viewing this post. I am a newbie for R language.

I want to find if one column(not specified one) is a duplicate of the other, and return a matrix with dimensions num.duplicates x 2 with each row giving both indices of any pair of duplicated variables. the matrix is organized so that first column is the lower number of the pair, and it is increasing ordered.

Let say I have a dataset

   v1 v2 v3 v4 v5 v6
1  1  1  2  4  2  1
2  2  2  3  5  3  2
3  3  3  4  6  4  3

and I want this

      [,1] [,2]
[1,]    1    2
[2,]    1    6
[3,]    2    6
[4,]    3    5

Please help, thank you!

Upvotes: 1

Views: 469

Answers (3)

thelatemail
thelatemail

Reputation: 93938

Something like this I suppose:

out <- data.frame(t(combn(1:ncol(dd),2)))
out[combn(1:ncol(dd),2,FUN=function(x) all(dd[x[1]]==dd[x[2]])),]

#   X1 X2
#1   1  2
#5   1  6
#9   2  6
#11  3  5

Upvotes: 1

tonytonov
tonytonov

Reputation: 25638

First, generate all possible combinatons with expand.grid. Second, remove duplicates and sort in desired order. Third, use sapply to find indexes of repeated columns:

kk <- expand.grid(1:ncol(df), 1:ncol(df))
nn <- kk[kk[, 1] > kk[, 2], 2:1]
nn[sapply(1:nrow(nn), 
         function(i) all(df[, nn[i, 1]] == df[, nn[i, 2]])), ]
   Var2 Var1
2     1    2
6     1    6
12    2    6
17    3    5

The approach I propose is R-ish, but I suppose writing a simple double loop is justified for this case, especially if you recently started learning the language.

Upvotes: 0

MrFlick
MrFlick

Reputation: 206546

I feel like i'm missing something more simple, but this seems to work.

Here's the sample data.

dd <- data.frame(
    v1 = 1:3, v2 = 1:3, v3 = 2:4, 
    v4 = 4:6, v5 = 2:4, v6 = 1:3
)

Now i'll assign each column to a group using ave() to look for duplicates. Then I'll count the number of columns in group

groups <- ave(1:ncol(dd), as.list(as.data.frame(t(dd))), FUN=min, drop=T)

Now that I have the groups, i'll split the column indexes up by those groups, if there is more than one, i'll grab all pairwise combinations. That will create a wide matrix and I flip it to a tall-line as you desire with t()

morethanone <- function(x) length(x)>1
dups <- t(do.call(cbind, 
    lapply(Filter(morethanone, split(1:ncol(dd), groups)), combn, 2)
))

That returns

     [,1] [,2]
[1,]    1    2
[2,]    1    6
[3,]    2    6
[4,]    3    5

as desired

Upvotes: 0

Related Questions