Reputation: 331

subsetting dataframes based on column values in r

Given a dataframe ex:

a <- c(1:3,4:6)
b <- c(2:4,3,2,1)
c <- cbind(a,b)

i would like to subset dataframe by removing rows with similar comparison (ex: row3: 3,4 is same as row4: 4,3) and have only one of them.

Upvotes: 0

Answers (3)

Ari B. Friedman

Reputation: 72731

Assuming d is your matrix, not c:

e <- unique(apply(d,1,function(x) paste(sort(x),collapse="~")))
> t(sapply(strsplit(e,"~"),as.numeric))
     [,1] [,2]
[1,]    1    2
[2,]    2    3
[3,]    3    4
[4,]    2    5
[5,]    1    6

Breaking it down:

First line

apply(d,1,function(x) ... ) takes each row of d and passes it as a vector x to the anonymous function whose body I've called ... here.

The function body is paste(sort(x),collapse="~"), which sorts the vector and then turns it into a length-one vector with each element separated by a ~.

So the apply call overall is going to return a character vector where each element used to be a row of the matrix.

Then unique keeps only the unique elements. The sorting ensures that this does what we want it to.

Second line

strsplit(e,"~") splits our character vector back into a separated form. In this case, it's a list where each element is a character vector of the numbers that comprise each row.

sapply(...,as.numeric) applies as.numeric() to each element of the list. So we convert the character vector back to a numeric vector. Since the s in sapply stands for "simplify," it will create a matrix from this.

But it's the wrong direction (2x5 instead of 5x2)! t() transposes the matrix to the original form.

Upvotes: 2

wotuzu17

Reputation: 157

in your example, c is not a data.frame but a matrix. c shouldn't be used as variable name, as other stated.

in one line, you can do:

a <- c(1:3,4:6)
b <- c(2:4,3,2,1)
cc <- cbind(a,b)
cc[!duplicated(t(apply(cc,1,sort))), ]
     a b
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 5 2
[5,] 6 1

Upvotes: 1

dayne

Reputation: 7774

a <- c(1:3,4:6)
b <- c(2:4,3,2,1)
d <- cbind(a,b)
e <- t(apply(d,1,function(x){x[order(x)]}))
d <- d[!duplicated(e),]

> d
     a b
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 5 2
[5,] 6 1

Upvotes: 3

subsetting dataframes based on column values in r

Answers (3)

Related Questions