Reputation: 21400
I have a dataframe with f
requency counts of words and the number of non-NA
counts in column size
:
df <- structure(list(size = c(4, 5, 6, 7, 6, 6, 5, 4),
f1 = c(100L, 87L, 100L, 100L,100L, 100L, 100L, 100L),
f2 = c(97L, 88L, 96L, 93L, 99L, 100L, 88L, 105L),
f3 = c(99L, 77L, 96L, 66L, 95L, 88L, 87L, 80L),
f4 = c(77L, 67L, 66L, 30L, 27L, 39L, 82L, 80L),
f5 = c(NA, 61L, 70L, 29L, 22L, 33L, 44L, NA),
f6 = c(NA, NA, 18L, 90L, 9L, 11L, NA, NA),
f7 = c(NA, NA, NA, 24L, NA, NA, NA, NA)),
row.names = c(NA, -8L), class = "data.frame")
What I want to do is filter the dataframe for rows where the f
values continuously decrease. I'm able to do this with is.unsorted
in dplyr
like so:
df %>%
select(matches("f\\d")) %>%
filter(apply(., 1, function(x) !is.unsorted(rev(x), na.rm = T, strictly = FALSE)))
f1 f2 f3 f4 f5 f6 f7
1 100 99 95 27 22 9 NA
2 100 100 88 39 33 11 NA
3 100 88 87 82 44 NA NA
However, this method is too rigid as it requires that the decrease be steady across all columns. I'd like to take a more flexible approach by filtering for rows where only the first two f
values and the last two positive f
values grouped by size
must decrease.
How can this be done?
EXPECTED:
df
size f1 f2 f3 f4 f5 f6 f7
3 6 100 96 96 66 70 18 NA
4 7 100 93 66 30 29 90 24
5 6 100 99 95 27 22 9 NA
6 6 100 100 88 39 33 11 NA
7 5 100 88 87 82 44 NA NA
Upvotes: 1
Views: 155
Reputation: 26218
Another approach using purrr
and tidyr::nest
cond
column will give you TRUE
which fits in your critera and FALSE
which doesn't.df %>% rowwise() %>%
mutate(F1 = f1,
F2 = f2,
Fn_1 = get(paste0('f', size - 1)),
Fn = get(paste0('f', size))) %>%
nest(cond = c(F1, F2, Fn_1, Fn)) %>% ungroup() %>%
mutate(cond = map(cond, ~ .x %>%
pivot_longer(everything()) %>%
mutate(value = c(0, diff(value)) >0) %>%
summarise(d = !sum(value)) %>%
pull(d)
)) %>%
unnest(cond)
# A tibble: 8 x 9
size f1 f2 f3 f4 f5 f6 f7 cond
<dbl> <int> <int> <int> <int> <int> <int> <int> <lgl>
1 4 100 97 99 77 NA NA NA FALSE
2 5 87 88 77 67 61 NA NA FALSE
3 6 100 96 96 66 70 18 NA TRUE
4 7 100 93 66 30 29 90 24 TRUE
5 6 100 99 95 27 22 9 NA TRUE
6 6 100 100 88 39 33 11 NA TRUE
7 5 100 88 87 82 44 NA NA TRUE
8 4 100 105 80 80 NA NA NA FALSE
Earlier Answer I think you may do this with semi_join
in tidyverse
(Matches with your expected results)
df
has been pivoteddummy
col is always negative or 0
>0
and will therefore be filtered outrowids
where your condition is met have been kept.rowid
s have been used as a filter join (semi_join
)for original df
giving us the rows we need.df %>%
mutate(rowid = row_number()) %>%
semi_join(df %>%
mutate(rowid = row_number()) %>%
pivot_longer(starts_with('f'), values_drop_na = T) %>%
group_by(rowid) %>%
slice(unique(1:2, (n()-1):n())) %>%
mutate(dummy = c(0, diff(value)) > 0) %>%
summarise(sum = sum(dummy)) %>%
filter(sum == 0) %>%
select(rowid), by = 'rowid') %>%
select(-rowid)
size f1 f2 f3 f4 f5 f6 f7
1 6 100 96 96 66 70 18 NA
2 7 100 93 66 30 29 90 24
3 6 100 99 95 27 22 9 NA
4 6 100 100 88 39 33 11 NA
5 5 100 88 87 82 44 NA NA
Upvotes: 2
Reputation: 17648
You can try a pmap
& map
approach
pmap(df, ~na.omit(c(...)) %>% .[unique(c(2:3, length(.)-1,length(.)))] %>% diff(.) <= 0) %>%
map_lgl(all) %>%
df[.,]
size f1 f2 f3 f4 f5 f6 f7
3 6 100 96 96 66 70 18 NA
4 7 100 93 66 30 29 90 24
5 6 100 99 95 27 22 9 NA
6 6 100 100 88 39 33 11 NA
7 5 100 88 87 82 44 NA NA
Within pmap
the NA
s are removed, then the first two and the last two columns are selected and lastly the difference is checked to be less or equal then zero. The map function checks if the condition is TRUE
across all values. Finally df
is subsetted.
Only trailing NA
s can be removed using zoo::na.trim(c(...), sides = "right")
Upvotes: 2
Reputation: 21908
Dedicated to my friend @AnilGoyal
You can also use this approach, it may sound a bit complicated but I tried to propose a different & flexible solution. It took me quite a while to understand what is exactly the conditions to be applied but thanks to dear @AnilGoyal I finally found out.
library(dplyr)
library(purrr)
df %>%
mutate(dec = pmap(df %>%
select(!size), ~ diff(c(...)[!is.na(c(...))][1:2]) <= 0 &
diff(rev(c(...)[!is.na(c(...))][2:1])) <= 0 &
diff(c(...)[!is.na(c(...))][c(2, length(c(...)[!is.na(c(...))]) - 1)]) <= 0)) %>%
filter(dec == TRUE) %>%
select(-dec)
size f1 f2 f3 f4 f5 f6 f7
1 6 100 96 96 66 70 18 NA
2 7 100 93 66 30 29 90 24
3 6 100 99 95 27 22 9 NA
4 6 100 100 88 39 33 11 NA
5 5 100 88 87 82 44 NA NA
First & second logical expression will check if first two values as well as last two non_NA values are in decreasing order at the same time. While the third one will check whether the second element and second to last element are also in decreasing order.
Upvotes: 2