B. Davis
B. Davis

Reputation: 3441

Subset dataframe based on duplicate values in multiple rows

How can I subset the dataframe below to show only the rows were columns AAA:CCC all have the same value and retain the IndID field?

Dat <- data.frame(IndID = LETTERS[seq(1,10)],
                  AAA = c(1,5,3,2,3,1,5,4,6,2),
                  BBB = c(1,8,5,2,5,4,8,4,4,5),
                  CCC = c(1,5,3,2,3,5,5,4,6,5))

> Dat
   IndID AAA BBB CCC
1      A   1   1   1
2      B   5   8   5
3      C   3   5   3
4      D   2   2   2
5      E   3   5   3
6      F   1   4   5
7      G   5   8   5
8      H   4   4   4
9      I   6   4   6
10     J   2   5   5

I would like to return the following result.

Result <- data.frame(IndID = c("A", "D", "H"),
                  AAA = c(1,2,4),
                  BBB = c(1,2,4),
                  CCC = c(1,2,4))

> Result
  IndID AAA BBB CCC
1     A   1   1   1
2     D   2   2   2
3     H   4   4   4

I have found a number of related posts including Find duplicated rows (based on 2 columns) in Data Frame in R and Find duplicated elements with dplyr, among others, but have not been able to reproduce the desired result with three columns. For example, while close the code below shows all distinct rows, but with the undesired result of including rows were only two of the values are equal.

Dat %>% distinct(AAA, BBB, CCC) 

I suspect the solution involves filter but an not sure how to obtain the desired result from the example mentioned above. A dplyr solution is preferred.

Addition I also wonder if this could be applied to multiple factors. For example with the data below, which contains an IndID and three dates stored as factors, can a similar result be produced by either allowing factors or changing the factors to numeric values?

Dat <- structure(list(GenIndID = structure(c(1L, 2L, 6L, 7L, 3L, 4L, 
8L, 5L), .Label = c("BHS_601", "BHS_603", "BHS_604", "BHS_605", 
"BHS_631", "BHS_635", "BHS_636", "BHS_637"), class = "factor"), 
    CptrDate = structure(c(1L, 2L, 3L, 3L, 2L, 2L, 3L, 4L), .Label = c("2016-02-01", 
    "2016-02-02", "2016-02-04", "2016-12-11"), class = "factor"), 
    DtLastAlive = structure(c(2L, 2L, 1L, 1L, 2L, 2L, 1L, 3L), .Label = c("2016-02-04", 
    "2017-07-13", "2017-08-27"), class = "factor"), DtFnlFate = structure(c(2L, 
    2L, 1L, 1L, 2L, 2L, 1L, 3L), .Label = c("2016-02-04", "2017-07-13", 
    "2017-08-27"), class = "factor")), .Names = c("GenIndID", 
"CptrDate", "DtLastAlive", "DtFnlFate"), row.names = c(82L, 83L, 
224L, 225L, 84L, 85L, 226L, 360L), class = "data.frame")


> Dat
    GenIndID   CptrDate DtLastAlive  DtFnlFate
82   BHS_601 2016-02-01  2017-07-13 2017-07-13
83   BHS_603 2016-02-02  2017-07-13 2017-07-13
224  BHS_635 2016-02-04  2016-02-04 2016-02-04
225  BHS_636 2016-02-04  2016-02-04 2016-02-04
84   BHS_604 2016-02-02  2017-07-13 2017-07-13
85   BHS_605 2016-02-02  2017-07-13 2017-07-13
226  BHS_637 2016-02-04  2016-02-04 2016-02-04
360  BHS_631 2016-12-11  2017-08-27 2017-08-27

with the desired result being

> Dat[c(3, 4, 7),]
    GenIndID   CptrDate DtLastAlive  DtFnlFate
224  BHS_635 2016-02-04  2016-02-04 2016-02-04
225  BHS_636 2016-02-04  2016-02-04 2016-02-04
226  BHS_637 2016-02-04  2016-02-04 2016-02-04

Upvotes: 2

Views: 2043

Answers (4)

eddi
eddi

Reputation: 49448

You can have a vectorized solution:

Dat[do.call(function(...) pmax(...) - pmin(...), Dat[, -1]) == 0,]
#  IndID AAA BBB CCC
#1     A   1   1   1
#4     D   2   2   2
#8     H   4   4   4

Upvotes: 1

acylam
acylam

Reputation: 18681

Here's another solution with apply and all:

Dat[apply(Dat[,-1], 1, function(x) all(x==x[1])),]

or with filter_at from dplyr:

library(dplyr)
Dat %>% 
  filter_at(vars(AAA:CCC), all_vars(. == .data$AAA))

Result:

  IndID AAA BBB CCC
1     A   1   1   1
4     D   2   2   2
8     H   4   4   4

Edit:

In response to OP's additional example, the apply example would work regardless of variable type. So the following would work for the new example:

Dat[apply(Dat[,-1], 1, function(x) all(x==x[1])),]

Result:

    GenIndID   CptrDate DtLastAlive  DtFnlFate
224  BHS_635 2016-02-04  2016-02-04 2016-02-04
225  BHS_636 2016-02-04  2016-02-04 2016-02-04
226  BHS_637 2016-02-04  2016-02-04 2016-02-04

For filter_at, however, if the columns to compare are factor's, one would need to first convert to character:

Dat %>% 
  filter_at(vars(-1), all_vars(as.character(.) == .data$CptrDate))

Notice that you only need to convert . to character, but not .data$CptrDate because factors can be compered with characters, but not with another factor with different levels.

An alternative would be:

Dat %>%
  mutate_at(vars(-1), as.character) %>%
  filter_at(vars(-1), all_vars(. == .data$CptrDate))

Result:

  GenIndID   CptrDate DtLastAlive  DtFnlFate
1  BHS_635 2016-02-04  2016-02-04 2016-02-04
2  BHS_636 2016-02-04  2016-02-04 2016-02-04
3  BHS_637 2016-02-04  2016-02-04 2016-02-04

Upvotes: 6

Adam Sampson
Adam Sampson

Reputation: 2021

If all you need is a subset of 3 columns you can use the DF[,] operator.

# DF[where rows have a value, select columns]
# [where rows where AAA==BBB==CCC, select all columns with ""]
temp <- Dat[Dat$AAA == Dat$BBB & Dat$BBB == Dat$CCC,] 

Upvotes: 2

KenHBS
KenHBS

Reputation: 7164

You could use a combination of range() and diff().

Dat[apply(Dat[ ,-1], 1, function(x) diff(range(x)))==0, ] 
#   IndID AAA BBB CCC
# 1     A   1   1   1
# 4     D   2   2   2
# 8     H   4   4   4

range() gives you the minimum and the maximum of a vector. diff() gives you the difference between values in a vector. If the difference between min-value and max-value is zero, you know all values are equal. If we apply this row-wise using apply, we get a TRUE/FALSE vector that we can use to index Dat.


Small examples to illustrate the logic:

test <- c(1, 5, 3)
minmax <- range(test)    # gives c(1,5)
diff(minmax)             # gives 4
diff(range(c(1, 1, 1)))  # gives 0 

If we check for every row, whether or not diff(range(your_row)) equals zero, we can use the output of that to index Dat, like we did above.

Upvotes: 1

Related Questions