Reputation: 2126

Select rows in dataframe conditional on a switching boolean variable in another column in R

Let's say I have the following dataframe in R:

set.seed(23)

# Create sample data
time = 1:15
x = rnorm(n = 15) 
y = rnorm(n = 15)
boolean = sample(c(TRUE,FALSE), 15, TRUE)
df <- data.frame(time, x, y, boolean)

# Output
> df
time           x            y boolean
1     1  0.19321233  0.308136896    TRUE
2     2 -0.43468211 -0.520178315    TRUE
3     3  0.91326710 -0.442313801   FALSE # select
4     4  1.79338809 -0.599312812    TRUE # select
5     5  0.99660511  1.294577829    TRUE
6     6  1.10749049  0.835391247    TRUE
7     7 -0.27808628 -0.566015100    TRUE
8     8  1.01920549  0.788419350   FALSE # select
9     9  0.04543718 -1.165929326    TRUE # select
10   10  1.57577959 -0.530820006   FALSE # select
11   11  0.21828845 -0.001058737   FALSE
12   12 -1.04653534 -0.512562365   FALSE
13   13 -0.28868865  1.242867513   FALSE
14   14  0.48155029 -0.660582851   FALSE
15   15 -1.21637643  0.166624215    TRUE # select

Problem

I would like to select all the rows, in which the boolean in the 4th column switches from FALSE to TRUE or vice versa (indicated in the dataframe above).

Question

How do I do this in R?

Attempt

I have found the select() and the select_if() functions in the tidyverse package, however, I am not able to select the values based on the previous value in the column.

Upvotes: 1

Answers (3)

s_baldur

Reputation: 33603

Using the helper function shift() from the data.table package (and the correct data provided by Ronak):

subset(df, boolean != shift(boolean, fill = boolean[1]))

   time           x            y boolean
2     2 -0.43468211 -0.566015100    TRUE
3     3  0.91326710  0.788419350   FALSE
6     6  1.10749049 -0.001058737    TRUE
8     8  1.01920549  1.242867513   FALSE
9     9  0.04543718 -0.660582851    TRUE
13   13 -0.28868865 -1.146665860   FALSE
15   15 -1.21637643 -0.202111683    TRUE

Upvotes: 1

Cole

Reputation: 11255

Here's another base solution::

df[c(FALSE, diff(df$boolean) != 0), ]

   time           x            y boolean
2     2 -0.43468211 -0.566015100    TRUE
3     3  0.91326710  0.788419350   FALSE
6     6  1.10749049 -0.001058737    TRUE
8     8  1.01920549  1.242867513   FALSE
9     9  0.04543718 -0.660582851    TRUE
13   13 -0.28868865 -1.146665860   FALSE
15   15 -1.21637643 -0.202111683    TRUE

This relies on taking the difference between TRUE and FALSE. If it's changing, the difference will be either -1 or 1.

Upvotes: 2

Ronak Shah

Reputation: 389225

We can use rle to create a counter which increments for every change in boolean value. We use duplicated and select only the first row for each counter. This will also select the first row but since it is not an actual change in boolean value, we remove that row (by using [-1]).

df[!duplicated(with(rle(df$boolean), rep(seq_along(values), lengths))), ][-1, ]

#   time           x            y boolean
#2     2 -0.43468211 -0.566015100    TRUE
#3     3  0.91326710  0.788419350   FALSE
#6     6  1.10749049 -0.001058737    TRUE
#8     8  1.01920549  1.242867513   FALSE
#9     9  0.04543718 -0.660582851    TRUE
#13   13 -0.28868865 -1.146665860   FALSE
#15   15 -1.21637643 -0.202111683    TRUE

The same logic can be applied using data.table::rleid which will make it a bit shorter

df[!duplicated(data.table::rleid(df$boolean)), ][-1, ]

In dplyr, we can create groups using lag and cumsum and select first row of every group.

library(dplyr)
df %>%
  group_by(group = cumsum(boolean != lag(boolean, default = first(boolean)))) %>%
  slice(1L) %>%
  ungroup %>%
  slice(-1L) %>%
  select(-group)

data

df <- structure(list(time = 1:15, x = c(0.19321233, -0.43468211, 0.9132671, 
1.79338809, 0.99660511, 1.10749049, -0.27808628, 1.01920549, 
0.04543718, 1.57577959, 0.21828845, -1.04653534, -0.28868865, 
0.48155029, -1.21637643), y = c(0.835391247, -0.5660151, 0.78841935, 
-1.165929326, -0.530820006, -0.001058737, -0.512562365, 1.242867513, 
-0.660582851, 0.166624215, -0.55320524, 0.098181415, -1.14666586, 
-1.249927257, -0.202111683), boolean = c(FALSE, TRUE, FALSE, 
FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, 
FALSE, TRUE)), class = "data.frame", row.names = c("1", "2", 
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14","15"))

Upvotes: 2

Select rows in dataframe conditional on a switching boolean variable in another column in R

Answers (3)

Related Questions