Anna
Anna

Reputation: 187

Select rows from grouped dataframe based on duplicate values

I have a dataframe with 3 columns. The id of each individual, the number of group they belong (gr) and location codes (loc). What I am trying to do is identify which individuals visit 2 locations with the following sequence: Location 1 -> Location 2 -> Location 1.

Dummy dataset:

id <- c(1,1,1,1,1,1,1,2,2,2,2,2,4,4,4,4,4,4,4,4)
gr<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,4,4,4,4,4,4,4)

loc <- c(5,5,4,4,5,5,5,3,3,3,3,2,2,2,2,3,3,2,2,2)

df<- data.frame(id,gr, loc) 

I have tried using a diff function, to identify differences between the locations:

dif<- diff(as.numeric(df$loc))

But I can't find any other way to move forward. In addition this approach doesn't account for the groups of each individual (and the ids repeat between groups). I was thinking maybe using a lag function but not sure how or if it helps at all. Any recommendations? Many thanks in advance, I'm still pretty new in R.

Desired output:

id<- c(1,4)
gr<- c(1,4)
out<- data.frame(cbind(id, gr))

Upvotes: 1

Views: 50

Answers (2)

ThomasIsCoding
ThomasIsCoding

Reputation: 102700

A possible data.table option

unique(
  setDT(df)[
    ,
    q := rleid(loc), .(id, gr)
  ][
    ,
    .SD[uniqueN(q) == 3 & first(loc) == last(loc)], .(id, gr)
  ][
    ,
    .(id, gr)
  ]
)

gives

   id gr
1:  1  1
2:  4  4

Upvotes: 1

akrun
akrun

Reputation: 887881

May be this works

library(dplyr)
library(data.table)
df %>% 
    group_by(id) %>% 
    filter(n_distinct(rleid(loc)) >2) %>%
    slice_tail(n = 1) %>%
    select(-loc) %>%
    ungroup
# A tibble: 2 x 2
#    id    gr
#  <dbl> <dbl>
#1     1     1
#2     4     4

Upvotes: 1

Related Questions