Reputation: 331
Given a dataframe ex:
a <- c(1:3,4:6)
b <- c(2:4,3,2,1)
c <- cbind(a,b)
i would like to subset dataframe by removing rows with similar comparison (ex: row3: 3,4 is same as row4: 4,3) and have only one of them.
Upvotes: 0
Views: 147
Reputation: 72731
Assuming d
is your matrix, not c
:
e <- unique(apply(d,1,function(x) paste(sort(x),collapse="~")))
> t(sapply(strsplit(e,"~"),as.numeric))
[,1] [,2]
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 2 5
[5,] 1 6
Breaking it down:
First line
apply(d,1,function(x) ... )
takes each row of d and passes it as a vector x
to the anonymous function whose body I've called ...
here.
The function body is paste(sort(x),collapse="~")
, which sorts the vector and then turns it into a length-one vector with each element separated by a ~
.
So the apply
call overall is going to return a character vector where each element used to be a row of the matrix.
Then unique
keeps only the unique elements. The sorting ensures that this does what we want it to.
Second line
strsplit(e,"~")
splits our character vector back into a separated form. In this case, it's a list where each element is a character vector of the numbers that comprise each row.
sapply(...,as.numeric)
applies as.numeric()
to each element of the list. So we convert the character vector back to a numeric vector. Since the s
in sapply
stands for "simplify," it will create a matrix from this.
But it's the wrong direction (2x5 instead of 5x2)! t()
transposes the matrix to the original form.
Upvotes: 2
Reputation: 157
in your example, c is not a data.frame but a matrix. c shouldn't be used as variable name, as other stated.
in one line, you can do:
a <- c(1:3,4:6)
b <- c(2:4,3,2,1)
cc <- cbind(a,b)
cc[!duplicated(t(apply(cc,1,sort))), ]
a b
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 5 2
[5,] 6 1
Upvotes: 1
Reputation: 7774
a <- c(1:3,4:6)
b <- c(2:4,3,2,1)
d <- cbind(a,b)
e <- t(apply(d,1,function(x){x[order(x)]}))
d <- d[!duplicated(e),]
> d
a b
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 5 2
[5,] 6 1
Upvotes: 3