goegges
goegges

Reputation: 107

Filter out rows between two values

I have a problem regarding filtering out some rows.

Sample dataset:

df <- data.frame(id = c("1", "1", "1", "2", "2", "2", "3", "3"), description= c("Start", "Something", "Final", "Start", "Some Other Thing", "Final", "Start", "Final"), timestamp = c("2017-07-26 23:41:16", "2017-07-27 20:23:16", "2017-07-29 07:06:53", "2017-07-24 04:53:02", "2017-07-25 10:27:02", "2017-07-26 16:51:43", "2017-07-13 08:33:05")))

Now I want to delete all groups where no other values between description = "Start" and description ="Final" exist. And this should be done for each id group. In this example it would be the group with the ID 3.

Any help would be appreciated. Thanks in advance!

Upvotes: 0

Views: 266

Answers (3)

Yuriy Saraykin
Yuriy Saraykin

Reputation: 8880

another solution

library(tidyverse)
df %>% 
  group_by(id) %>% 
  mutate(n = n()) %>% 
  filter(n != 2)

Upvotes: 0

Taufi
Taufi

Reputation: 1577

So the following might be one solution of your problem.

Test = df %>% aggregate(description~id, data=., FUN=function(x) c(count=length(x)))
Test$id = as.factor(Test$id)
df = inner_join(df, Test, by = "id")
df = df[df$description.y > 2, ]

The idea is to filter out all groups that only have two descriptions (Start, Final) via an inner_join. The output is

> df
  id    description.x           timestamp description.y
1  1            Start 2017-07-26 23:41:16             3
2  1        Something 2017-07-27 20:23:16             3
3  1            Final 2017-07-29 07:06:53             3
4  2            Start 2017-07-24 04:53:02             3
5  2 Some Other Thing 2017-07-25 10:27:02             3
6  2            Final 2017-07-26 16:51:43             3

Is that what you had in mind?

Upvotes: 0

caldwellst
caldwellst

Reputation: 5956

If we convert timestamp to datetime, then we can just sort the data and use cumsum to do what you want (I think).

library(dplyr)
library(lubridate)

df %>%
  mutate(timestamp = lubridate::as_datetime(timestamp)) %>%
  group_by(id) %>%
  arrange(id, timestamp) %>%
  mutate(tracker = cumsum(description %in% c("Start", "Final"))) %>%
  filter((tracker %% 2 == 1) & description != "Start")
#> # A tibble: 2 x 4
#> # Groups:   id [2]
#>   id    description      timestamp           tracker
#>   <fct> <fct>            <dttm>                <int>
#> 1 1     Something        2017-07-27 20:23:16       1
#> 2 2     Some Other Thing 2017-07-25 10:27:02       1

Upvotes: 0

Related Questions