Reputation: 3441
How can I subset the dataframe below to show only the rows were columns AAA:CCC
all have the same value and retain the IndID
field?
Dat <- data.frame(IndID = LETTERS[seq(1,10)],
AAA = c(1,5,3,2,3,1,5,4,6,2),
BBB = c(1,8,5,2,5,4,8,4,4,5),
CCC = c(1,5,3,2,3,5,5,4,6,5))
> Dat
IndID AAA BBB CCC
1 A 1 1 1
2 B 5 8 5
3 C 3 5 3
4 D 2 2 2
5 E 3 5 3
6 F 1 4 5
7 G 5 8 5
8 H 4 4 4
9 I 6 4 6
10 J 2 5 5
I would like to return the following result.
Result <- data.frame(IndID = c("A", "D", "H"),
AAA = c(1,2,4),
BBB = c(1,2,4),
CCC = c(1,2,4))
> Result
IndID AAA BBB CCC
1 A 1 1 1
2 D 2 2 2
3 H 4 4 4
I have found a number of related posts including Find duplicated rows (based on 2 columns) in Data Frame in R and Find duplicated elements with dplyr, among others, but have not been able to reproduce the desired result with three columns. For example, while close the code below shows all distinct
rows, but with the undesired result of including rows were only two of the values are equal.
Dat %>% distinct(AAA, BBB, CCC)
I suspect the solution involves filter
but an not sure how to obtain the desired result from the example mentioned above. A dplyr
solution is preferred.
Addition
I also wonder if this could be applied to multiple factors. For example with the data below, which contains an IndID
and three dates stored as factors, can a similar result be produced by either allowing factors or changing the factors to numeric values?
Dat <- structure(list(GenIndID = structure(c(1L, 2L, 6L, 7L, 3L, 4L,
8L, 5L), .Label = c("BHS_601", "BHS_603", "BHS_604", "BHS_605",
"BHS_631", "BHS_635", "BHS_636", "BHS_637"), class = "factor"),
CptrDate = structure(c(1L, 2L, 3L, 3L, 2L, 2L, 3L, 4L), .Label = c("2016-02-01",
"2016-02-02", "2016-02-04", "2016-12-11"), class = "factor"),
DtLastAlive = structure(c(2L, 2L, 1L, 1L, 2L, 2L, 1L, 3L), .Label = c("2016-02-04",
"2017-07-13", "2017-08-27"), class = "factor"), DtFnlFate = structure(c(2L,
2L, 1L, 1L, 2L, 2L, 1L, 3L), .Label = c("2016-02-04", "2017-07-13",
"2017-08-27"), class = "factor")), .Names = c("GenIndID",
"CptrDate", "DtLastAlive", "DtFnlFate"), row.names = c(82L, 83L,
224L, 225L, 84L, 85L, 226L, 360L), class = "data.frame")
> Dat
GenIndID CptrDate DtLastAlive DtFnlFate
82 BHS_601 2016-02-01 2017-07-13 2017-07-13
83 BHS_603 2016-02-02 2017-07-13 2017-07-13
224 BHS_635 2016-02-04 2016-02-04 2016-02-04
225 BHS_636 2016-02-04 2016-02-04 2016-02-04
84 BHS_604 2016-02-02 2017-07-13 2017-07-13
85 BHS_605 2016-02-02 2017-07-13 2017-07-13
226 BHS_637 2016-02-04 2016-02-04 2016-02-04
360 BHS_631 2016-12-11 2017-08-27 2017-08-27
with the desired result being
> Dat[c(3, 4, 7),]
GenIndID CptrDate DtLastAlive DtFnlFate
224 BHS_635 2016-02-04 2016-02-04 2016-02-04
225 BHS_636 2016-02-04 2016-02-04 2016-02-04
226 BHS_637 2016-02-04 2016-02-04 2016-02-04
Upvotes: 2
Views: 2043
Reputation: 49448
You can have a vectorized solution:
Dat[do.call(function(...) pmax(...) - pmin(...), Dat[, -1]) == 0,]
# IndID AAA BBB CCC
#1 A 1 1 1
#4 D 2 2 2
#8 H 4 4 4
Upvotes: 1
Reputation: 18681
Here's another solution with apply
and all
:
Dat[apply(Dat[,-1], 1, function(x) all(x==x[1])),]
or with filter_at
from dplyr
:
library(dplyr)
Dat %>%
filter_at(vars(AAA:CCC), all_vars(. == .data$AAA))
Result:
IndID AAA BBB CCC
1 A 1 1 1
4 D 2 2 2
8 H 4 4 4
Edit:
In response to OP's additional example, the apply
example would work regardless of variable type. So the following would work for the new example:
Dat[apply(Dat[,-1], 1, function(x) all(x==x[1])),]
Result:
GenIndID CptrDate DtLastAlive DtFnlFate
224 BHS_635 2016-02-04 2016-02-04 2016-02-04
225 BHS_636 2016-02-04 2016-02-04 2016-02-04
226 BHS_637 2016-02-04 2016-02-04 2016-02-04
For filter_at
, however, if the columns to compare are factor
's, one would need to first convert to character
:
Dat %>%
filter_at(vars(-1), all_vars(as.character(.) == .data$CptrDate))
Notice that you only need to convert .
to character, but not .data$CptrDate
because factors can be compered with characters, but not with another factor with different levels.
An alternative would be:
Dat %>%
mutate_at(vars(-1), as.character) %>%
filter_at(vars(-1), all_vars(. == .data$CptrDate))
Result:
GenIndID CptrDate DtLastAlive DtFnlFate
1 BHS_635 2016-02-04 2016-02-04 2016-02-04
2 BHS_636 2016-02-04 2016-02-04 2016-02-04
3 BHS_637 2016-02-04 2016-02-04 2016-02-04
Upvotes: 6
Reputation: 2021
If all you need is a subset of 3 columns you can use the DF[,] operator.
# DF[where rows have a value, select columns]
# [where rows where AAA==BBB==CCC, select all columns with ""]
temp <- Dat[Dat$AAA == Dat$BBB & Dat$BBB == Dat$CCC,]
Upvotes: 2
Reputation: 7164
You could use a combination of range()
and diff()
.
Dat[apply(Dat[ ,-1], 1, function(x) diff(range(x)))==0, ]
# IndID AAA BBB CCC
# 1 A 1 1 1
# 4 D 2 2 2
# 8 H 4 4 4
range()
gives you the minimum and the maximum of a vector. diff()
gives you the difference between values in a vector. If the difference between min-value
and max-value
is zero, you know all values are equal. If we apply this row-wise using apply
, we get a TRUE
/FALSE
vector that we can use to index Dat
.
Small examples to illustrate the logic:
test <- c(1, 5, 3)
minmax <- range(test) # gives c(1,5)
diff(minmax) # gives 4
diff(range(c(1, 1, 1))) # gives 0
If we check for every row, whether or not diff(range(your_row))
equals zero, we can use the output of that to index Dat
, like we did above.
Upvotes: 1