Robert Lepage
Robert Lepage

Reputation: 43

Group_by filtering with a dataframe

I have a tibble that's comprised of time series data. At each time-step, I have a given load (Loadz) and displacement (Dispz) that's associated with the test identification (FName):

# A tibble: 15,926 x 5
# Groups:   FName [92]
   TIMESTAMP             FName  Dispz Loadz Failflag
   <chr>                 <chr>  <dbl> <dbl> <chr>   
 1 2020-09-19 14:56:46   D-H1  0       0    ""      
 2 2020-09-19 14:56:46.5 D-H1  0.019   3.61 ""      
 3 2020-09-19 14:56:47   D-H1  0.04    8.91 ""      
 4 2020-09-19 14:56:47.5 D-H1  0.06   14.5  ""      
 5 2020-09-19 14:56:48   D-H1  0.0790 20.1  ""      
 6 2020-09-19 14:56:48.5 D-H1  0.101  25.7  ""      
 7 2020-09-19 14:56:49   D-H1  0.12   31.2  ""      
 8 2020-09-19 14:56:49.5 D-H1  0.142  36.1  ""      
 9 2020-09-19 14:56:50   D-H1  0.162  40.9  ""      
10 2020-09-19 14:56:50.5 D-H1  0.183  45.9  ""    

I have the code below which filters out all loads and displacements below a certain threshold (for displacements greater than 0.02mm and loads exceeding 10N), and then sweeps (or zeroes) everything so that all tests start at the origin.

dat<-df%>%
  group_by(FName)%>%
  filter(Dispz>0.02)%>%
  filter(Loadz>10)%>%
  mutate_if(is.numeric,funs(.-first(.)))%>%
  slice(1:which.max(Loadz))

However, I have a few tests that have some errors and need to be manually coerced. This is where I'm getting into trouble.

I have the FName identifications of those misbehaving tests and the associated TIMESTAMP value at the location where I need to zero the data, but finding a way to filter out with a look-up dataframe that also looks up the value from this lookup tibble is proving challenging.

    # Identify FName ids that need to be manually modified
    badlist<-c("WR-H2.2",
               "WR-H2.3", 
               "WR-H2.4")
#Find associated timestamps with each respective FName label
        timelist<-c("2020-10-04 12:31:06",
                    "2020-10-04 12:32:28",
                    "2020-10-04 12:33:46.5")

#Sample ID, and associated row to 
maninput<-tibble(x=badlist,y=timelist)

I'm not particularly good in R, so the sad attempted pseudo-code below is trying to express my intent. Essentially, it needs to look up the FName variable under why it is grouped (and that's my biggest issue, is this using "quo"?), and then find the associated TIMESTAMP value to act as the variable for the filter.

Newdat<-dat%>%
    group_by(FName)%>%
    filter(TIMESTAMP>maninput[FName== "group_by(FName)",2]

I'm stumped, but feel the answer is pretty simple.

Upvotes: 0

Views: 36

Answers (1)

shizundeiku
shizundeiku

Reputation: 320

This is a prime use case for joins. You can easily solve this by left_joining your tibble with the maninput one, then filtering by comparing with the added column.

You didn't really provide any example data that can be used directly, so I made my own:

library(tidyverse)
library(lubridate)

set.seed(0)

max_time = ymd_hms("2020-11-02 11:07:00 CET")
names = c("D-H1", "WR-H2.2", "WR-H2.3", "WR-H2.4")
obs = 10

df = names %>%
  map(~tibble(FName = .x,
              Dispz = cumsum(abs(rnorm(obs)) * 0.3),
              Loadz = cumsum(abs(rnorm(obs)) * 10),
              TIMESTAMP = seq(max_time - obs, max_time - 1, length.out = obs),
              Failflag = "")) %>%
  bind_rows()

dat = df %>%
  group_by(FName) %>%
  filter(Dispz > 0.02) %>%
  filter(Loadz > 10) %>%
  mutate(across(where(is.numeric), ~.x - first(.x))) %>%
  slice(1:which.max(Loadz)) %>%
  ungroup()

badlist = c("WR-H2.2", "WR-H2.3")
timelist = c("2020-11-02 11:06:54 CET", "2020-11-02 11:06:57 CET")
maninput = tibble(x = badlist, y = ymd_hms(timelist))

Now you can do the join:

newdat = dat %>%
  group_by(FName) %>%
  left_join(maninput, by = c("FName" = "x")) %>%
  filter(is.na(y) | TIMESTAMP > y) %>%
  select(-y)

We first join on the maninput table as explained, creating an y column for every test where the FName matches an x value in maninput. Then we filter, selecting only the observations where either the y column is NA (i.e. there is no matching test in maninput), or the TIMESTAMP is greater than the y column (i.e. the observation occurred after the specified time in maninput). Then we remove the y column from the data as it is no longer needed.

This results in a tibble where the selected tests only have data after their respective timestamps, which I'm pretty sure is what you were looking for:

# A tibble: 23 x 5
# Groups:   FName [4]
   FName   Dispz Loadz TIMESTAMP           Failflag
   <chr>   <dbl> <dbl> <dttm>              <chr>   
 1 D-H1    0      0    2020-11-02 11:06:51 ""      
 2 D-H1    0.399 11.5  2020-11-02 11:06:52 ""      
(...)     
 9 D-H1    2.46  49.7  2020-11-02 11:06:59 ""      
10 WR-H2.2 0.409 25.3  2020-11-02 11:06:55 ""      
11 WR-H2.2 0.735 35.2  2020-11-02 11:06:56 ""      
12 WR-H2.2 0.942 39.5  2020-11-02 11:06:57 ""      
13 WR-H2.2 1.33  51.9  2020-11-02 11:06:58 ""      
14 WR-H2.2 1.34  54.7  2020-11-02 11:06:59 ""      
15 WR-H2.3 1.98  41.2  2020-11-02 11:06:58 ""      
16 WR-H2.3 2.05  53.8  2020-11-02 11:06:59 ""      
17 WR-H2.4 0      0    2020-11-02 11:06:53 ""      
18 WR-H2.4 0.244  6.64 2020-11-02 11:06:54 ""      
(...)
23 WR-H2.4 0.949 43.8  2020-11-02 11:06:59 ""       

Upvotes: 1

Related Questions