Reputation: 2126
Let's say I have the following dataframe in R:
set.seed(23)
# Create sample data
time = 1:15
x = rnorm(n = 15)
y = rnorm(n = 15)
boolean = sample(c(TRUE,FALSE), 15, TRUE)
df <- data.frame(time, x, y, boolean)
# Output
> df
time x y boolean
1 1 0.19321233 0.308136896 TRUE
2 2 -0.43468211 -0.520178315 TRUE
3 3 0.91326710 -0.442313801 FALSE # select
4 4 1.79338809 -0.599312812 TRUE # select
5 5 0.99660511 1.294577829 TRUE
6 6 1.10749049 0.835391247 TRUE
7 7 -0.27808628 -0.566015100 TRUE
8 8 1.01920549 0.788419350 FALSE # select
9 9 0.04543718 -1.165929326 TRUE # select
10 10 1.57577959 -0.530820006 FALSE # select
11 11 0.21828845 -0.001058737 FALSE
12 12 -1.04653534 -0.512562365 FALSE
13 13 -0.28868865 1.242867513 FALSE
14 14 0.48155029 -0.660582851 FALSE
15 15 -1.21637643 0.166624215 TRUE # select
Problem
I would like to select all the rows, in which the boolean in the 4th column switches from FALSE
to TRUE
or vice versa (indicated in the dataframe above).
Question
How do I do this in R?
Attempt
I have found the select()
and the select_if()
functions in the tidyverse package
, however, I am not able to select the values based on the previous value in the column.
Upvotes: 1
Views: 2732
Reputation: 33603
Using the helper function shift()
from the data.table package (and the correct data provided by Ronak):
subset(df, boolean != shift(boolean, fill = boolean[1]))
time x y boolean
2 2 -0.43468211 -0.566015100 TRUE
3 3 0.91326710 0.788419350 FALSE
6 6 1.10749049 -0.001058737 TRUE
8 8 1.01920549 1.242867513 FALSE
9 9 0.04543718 -0.660582851 TRUE
13 13 -0.28868865 -1.146665860 FALSE
15 15 -1.21637643 -0.202111683 TRUE
Upvotes: 1
Reputation: 11255
Here's another base
solution::
df[c(FALSE, diff(df$boolean) != 0), ]
time x y boolean
2 2 -0.43468211 -0.566015100 TRUE
3 3 0.91326710 0.788419350 FALSE
6 6 1.10749049 -0.001058737 TRUE
8 8 1.01920549 1.242867513 FALSE
9 9 0.04543718 -0.660582851 TRUE
13 13 -0.28868865 -1.146665860 FALSE
15 15 -1.21637643 -0.202111683 TRUE
This relies on taking the difference between TRUE
and FALSE
. If it's changing, the difference will be either -1 or 1.
Upvotes: 2
Reputation: 389225
We can use rle
to create a counter which increments for every change in boolean
value. We use duplicated
and select only the first row for each counter. This will also select the first row but since it is not an actual change in boolean
value, we remove that row (by using [-1]
).
df[!duplicated(with(rle(df$boolean), rep(seq_along(values), lengths))), ][-1, ]
# time x y boolean
#2 2 -0.43468211 -0.566015100 TRUE
#3 3 0.91326710 0.788419350 FALSE
#6 6 1.10749049 -0.001058737 TRUE
#8 8 1.01920549 1.242867513 FALSE
#9 9 0.04543718 -0.660582851 TRUE
#13 13 -0.28868865 -1.146665860 FALSE
#15 15 -1.21637643 -0.202111683 TRUE
The same logic can be applied using data.table::rleid
which will make it a bit shorter
df[!duplicated(data.table::rleid(df$boolean)), ][-1, ]
In dplyr
, we can create groups using lag
and cumsum
and select first row of every group.
library(dplyr)
df %>%
group_by(group = cumsum(boolean != lag(boolean, default = first(boolean)))) %>%
slice(1L) %>%
ungroup %>%
slice(-1L) %>%
select(-group)
data
df <- structure(list(time = 1:15, x = c(0.19321233, -0.43468211, 0.9132671,
1.79338809, 0.99660511, 1.10749049, -0.27808628, 1.01920549,
0.04543718, 1.57577959, 0.21828845, -1.04653534, -0.28868865,
0.48155029, -1.21637643), y = c(0.835391247, -0.5660151, 0.78841935,
-1.165929326, -0.530820006, -0.001058737, -0.512562365, 1.242867513,
-0.660582851, 0.166624215, -0.55320524, 0.098181415, -1.14666586,
-1.249927257, -0.202111683), boolean = c(FALSE, TRUE, FALSE,
FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE,
FALSE, TRUE)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14","15"))
Upvotes: 2