Reputation: 803
I have a data set I would like to remove the rows of data that have duplicate information in 4 different columns.
foo<- data.frame(g1 = c("1","0","0","1","1"), v1 = c("7","5","4","4","3"), v2 = c("a","b","x","x","e"), y1 = c("y","c","f","f","w"), y2= c("y","y","y","f","c"), y3 = c("y","c","c","f","w"), y4= c("y","y","f","f","c"), y5=c("y","w","f","f","w"), y6=c("y","c","f","f","w"))
foo then looks like:
g1 v1 v2 y1 y2 y3 y4 y5 y6
1 1 7 a y y y y y y
2 0 5 b c y c y w c
3 0 4 x f y c f f f
4 1 4 x f f f f f f
5 1 3 e w c w c w w
Now, I want to remove any row that has duplicated data based on the Y1-6columns. So, only row 4 and 1 would be removed if done properly, based on all Y variables being the exact same. Its a multiple column condition.
I believe I am close, but its just not working correctly.
I have tried: new = foo[!(duplicated(foo[,1:6]))]
thinking to use the duplicated command that it would search and only find those that matched exactly?
I thought about using a conditional statement with &, but can't figure out how to do that either.
new = foo[foo$y1==foo$y2|foo$y3|foo$y4|foo$y5|foo$y6]
I thought about which but Im now overwhelmed and lost. I would expect foo to look like:
g1 v1 v2 y1 y2 y3 y4 y5 y6
2 0 5 b c y c y w c
3 0 4 x f y c f f f
5 1 3 e w c w c w w
Upvotes: 6
Views: 4934
Reputation: 81733
> foo[apply(foo[ , paste("y", 1:6, sep = "")], 1,
FUN = function(x) length(unique(x)) > 1 ), ]
g1 v1 v2 y1 y2 y3 y4 y5 y6
2 0 5 b c y c y w c
3 0 4 x f y c f f f
5 1 3 e w c w c w w
Upvotes: 10
Reputation: 263471
> foo[ !rowSums( apply( foo[2:6], 2, "!=", foo[1] ) )==0, ]
y1 y2 y3 y4 y5 y6
2 c y c y w c
3 f y c f f f
5 w c w c w w
> foo[ ! colSums( apply( foo, 1, duplicated, foo[1] ) ) == 5, ]
y1 y2 y3 y4 y5 y6
2 c y c y w c
3 f y c f f f
5 w c w c w w
Upvotes: 1