Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

Filter rows flexibly by first and last value of grouping variable

I have a dataframe with frequency counts of words and the number of non-NA counts in column size:

df <- structure(list(size = c(4, 5, 6, 7, 6, 6, 5, 4), 
                     f1 = c(100L, 87L, 100L, 100L,100L, 100L, 100L, 100L), 
                     f2 = c(97L, 88L, 96L, 93L, 99L, 100L, 88L, 105L), 
                     f3 = c(99L, 77L, 96L, 66L, 95L, 88L, 87L, 80L), 
                     f4 = c(77L, 67L, 66L, 30L, 27L, 39L, 82L, 80L),
                     f5 = c(NA, 61L, 70L, 29L, 22L, 33L, 44L, NA),
                     f6 = c(NA, NA, 18L, 90L, 9L, 11L, NA, NA),
                     f7 = c(NA, NA, NA, 24L, NA, NA, NA, NA)),
                row.names = c(NA, -8L), class = "data.frame")

What I want to do is filter the dataframe for rows where the fvalues continuously decrease. I'm able to do this with is.unsorted in dplyrlike so:

df %>%
  select(matches("f\\d")) %>%
  filter(apply(., 1, function(x) !is.unsorted(rev(x), na.rm = T, strictly = FALSE)))
   f1  f2 f3 f4 f5 f6 f7
1 100  99 95 27 22  9 NA
2 100 100 88 39 33 11 NA
3 100  88 87 82 44 NA NA

However, this method is too rigid as it requires that the decrease be steady across all columns. I'd like to take a more flexible approach by filtering for rows where only the first two f values and the last two positive f values grouped by size must decrease.

How can this be done?

EXPECTED:

df 
  size  f1  f2 f3 f4 f5 f6 f7
3    6 100  96 96 66 70 18 NA
4    7 100  93 66 30 29 90 24
5    6 100  99 95 27 22  9 NA
6    6 100 100 88 39 33 11 NA
7    5 100  88 87 82 44 NA NA

Upvotes: 1

Views: 155

Answers (3)

AnilGoyal
AnilGoyal

Reputation: 26218

Another approach using purrr and tidyr::nest

  • Here cond column will give you TRUE which fits in your critera and FALSE which doesn't.
df %>% rowwise() %>%
  mutate(F1 = f1,
         F2 = f2,
         Fn_1 = get(paste0('f', size - 1)),
         Fn = get(paste0('f', size))) %>%
  nest(cond = c(F1, F2, Fn_1, Fn)) %>% ungroup() %>%
  mutate(cond = map(cond, ~ .x %>% 
                      pivot_longer(everything()) %>%
                      mutate(value = c(0, diff(value)) >0) %>%
                      summarise(d = !sum(value)) %>%
                      pull(d)
                    )) %>%
  unnest(cond)

# A tibble: 8 x 9
   size    f1    f2    f3    f4    f5    f6    f7 cond 
  <dbl> <int> <int> <int> <int> <int> <int> <int> <lgl>
1     4   100    97    99    77    NA    NA    NA FALSE
2     5    87    88    77    67    61    NA    NA FALSE
3     6   100    96    96    66    70    18    NA TRUE 
4     7   100    93    66    30    29    90    24 TRUE 
5     6   100    99    95    27    22     9    NA TRUE 
6     6   100   100    88    39    33    11    NA TRUE 
7     5   100    88    87    82    44    NA    NA TRUE 
8     4   100   105    80    80    NA    NA    NA FALSE

Earlier Answer I think you may do this with semi_join in tidyverse (Matches with your expected results)

  • First expression inside parenthesis need to be seen
    • df has been pivoted
    • for every group (row) only four rows are kept and rest are filtered out
    • Differences checked whether dummy col is always negative or 0
    • If at least one TRUE in any group, that will sum(dummy) in that group >0 and will therefore be filtered out
    • the rowids where your condition is met have been kept.
  • Now the above rowids have been used as a filter join (semi_join)for original df giving us the rows we need.
df %>% 
  mutate(rowid = row_number()) %>%
  semi_join(df %>% 
              mutate(rowid = row_number()) %>%
              pivot_longer(starts_with('f'), values_drop_na = T) %>%
              group_by(rowid) %>%
              slice(unique(1:2, (n()-1):n())) %>%
              mutate(dummy = c(0, diff(value)) > 0) %>%
              summarise(sum = sum(dummy)) %>%
              filter(sum == 0) %>%
              select(rowid), by = 'rowid') %>%
  select(-rowid)
  size  f1  f2 f3 f4 f5 f6 f7
1    6 100  96 96 66 70 18 NA
2    7 100  93 66 30 29 90 24
3    6 100  99 95 27 22  9 NA
4    6 100 100 88 39 33 11 NA
5    5 100  88 87 82 44 NA NA

Upvotes: 2

Roman
Roman

Reputation: 17648

You can try a pmap & map approach

 pmap(df, ~na.omit(c(...)) %>% .[unique(c(2:3, length(.)-1,length(.)))] %>% diff(.) <= 0) %>% 
 map_lgl(all) %>% 
 df[.,]
  size  f1  f2 f3 f4 f5 f6 f7
3    6 100  96 96 66 70 18 NA
4    7 100  93 66 30 29 90 24
5    6 100  99 95 27 22  9 NA
6    6 100 100 88 39 33 11 NA
7    5 100  88 87 82 44 NA NA

Within pmap the NAs are removed, then the first two and the last two columns are selected and lastly the difference is checked to be less or equal then zero. The map function checks if the condition is TRUE across all values. Finally df is subsetted.

Only trailing NAs can be removed using zoo::na.trim(c(...), sides = "right")

Upvotes: 2

Anoushiravan R
Anoushiravan R

Reputation: 21908

Dedicated to my friend @AnilGoyal

You can also use this approach, it may sound a bit complicated but I tried to propose a different & flexible solution. It took me quite a while to understand what is exactly the conditions to be applied but thanks to dear @AnilGoyal I finally found out.

library(dplyr)
library(purrr)

df %>% 
  mutate(dec = pmap(df %>% 
                      select(!size), ~ diff(c(...)[!is.na(c(...))][1:2]) <= 0 & 
                      diff(rev(c(...)[!is.na(c(...))][2:1])) <= 0 & 
                      diff(c(...)[!is.na(c(...))][c(2, length(c(...)[!is.na(c(...))]) - 1)]) <= 0)) %>%
  filter(dec == TRUE) %>%
  select(-dec)

  size  f1  f2 f3 f4 f5 f6 f7
1    6 100  96 96 66 70 18 NA
2    7 100  93 66 30 29 90 24
3    6 100  99 95 27 22  9 NA
4    6 100 100 88 39 33 11 NA
5    5 100  88 87 82 44 NA NA

First & second logical expression will check if first two values as well as last two non_NA values are in decreasing order at the same time. While the third one will check whether the second element and second to last element are also in decreasing order.

Upvotes: 2

Related Questions