Reputation: 23
I am working with a dataset of animal behaviors, and am trying to create a new column ("environment") based on conditions fulfilled in another row. Specifically, I want the new column to return "water" if the behavior falls between the start/stop times of the behavior "o_water", and "land" if it falls outside these bounds. If this is unclear here is a minimal example:
library(dplyr)
library(magrittr)
otters <- data.frame(
observation_id = 1,
subject = 1,
behavior = c("o_water", "swim", "float", "o_land", "walk", "o_water", "float"),
start_time = c(1,1,2,6,6,11,11),
stop_time = c(5,3,4,10,9,15,14)
)
#this does it, but manually. need to go over very large dataset and search for conditions
otters <- otters %>%
group_by(subject, observation_id, behavior) %>%
mutate(environment = ifelse((start_time >= 1 & stop_time <= 5) |
(start_time >= 11 & stop_time <= 15), "water", "land"))
This is the output desired.
Groups: subject, observation_id, behavior [5]
observation_id subject behavior start_time stop_time environment
<dbl> <dbl> <fct> <dbl> <dbl> <chr>
1 1 1 o_water 1 5 water
2 1 1 swim 1 3 water
3 1 1 float 2 4 water
4 1 1 o_land 6 10 land
5 1 1 walk 6 9 land
6 1 1 o_water 11 15 water
7 1 1 float 11 14 water
The second set of commands is sort of what I want, but I need this to search out and apply it to an entire dataset rather than typing out each parameter. The grouping is so the functions are performed over the applicable rows; in the full dataset, there are multiple subjects and observation_id's.
I've tried using when() and case_when() to no avail, but I am very novice level at R so would appreciate any help!
Apologies for any missteps I've done. I haven't been able to find a problem quite like this elsewhere on stackoverflow.
Upvotes: 2
Views: 1263
Reputation: 4524
Here is another approach, in which I extract all the start_
and stop_times
of each behavior of o_water
into a list. As there are 2 entries for o_water
these lists have two elements.
With the help of purrr
I create a list pu
of vectors which span the start_
and stop_time
for each entry.
purrr
also helps to find which row in otters
falls within the sequences defined in pu
. l_true
is the sum of all columns which have a FALSE/TRUE, it contains either a 1, if there is one entry matching the time sequence or a 0 if it does not.
This list is bound to otters
and with ifelse
the new column is named.
library(tidyverse)
otters <- data.frame(
observation_id = 1,
subject = 1,
behavior = c("o_water", "swim", "float", "o_land", "walk", "o_water", "float"),
start_time = c(1,1,2,6,6,11,11),
stop_time = c(5,3,4,10,9,15,14)
)
# otters
# find all start_times of 'o_water'
otters %>%
dplyr::filter(grepl('water', behavior)) %>%
select(ends_with('time')) %>%
`[[`(1) -> start
start
#> [1] 1 11
# find all stop_times
otters %>%
dplyr::filter(grepl('water', behavior)) %>%
select(ends_with('time')) %>%
`[[`(2) -> stop
stop
#> [1] 5 15
# bring start and stop_times together in one
# list per 'o_water'
pu <- purrr::map2(start, stop, ~ .x : .y)
pu
#> [[1]]
#> [1] 1 2 3 4 5
#>
#> [[2]]
#> [1] 11 12 13 14 15
# check with pu, if start_ and stop_time of each row is
# in pu, and combine the row of FALSE/TRUE into a list
l_true <- map_dfc(pu, ~ otters$start_time %in% . & otters$stop_time %in% .) %>%
mutate(l = rowSums(.[1:ncol(.)])) %>% `[[`(ncol(.))
otters %>%
cbind(., l_true) %>%
mutate(ev = ifelse(l_true == 1, 'water', 'land'))
#> observation_id subject behavior start_time stop_time l_true ev
#> 1 1 1 o_water 1 5 1 water
#> 2 1 1 swim 1 3 1 water
#> 3 1 1 float 2 4 1 water
#> 4 1 1 o_land 6 10 0 land
#> 5 1 1 walk 6 9 0 land
#> 6 1 1 o_water 11 15 1 water
#> 7 1 1 float 11 14 1 water
Upvotes: 1
Reputation: 30474
Here is another approach with dplyr
that also uses fuzzyjoin
package.
You can separate your o_water
behavior rows from otters
and designate the environment
as water
.
Then, with fuzzy_left_join
, merge the o_water
rows with the rest of your data, where the start_time
and end_time
fall between the o_water
range.
The remaining NA
in environment
will be non-merged rows, which can be land
or other designation.
library(dplyr)
library(fuzzyjoin)
otters_water <- otters %>%
filter(behavior == "o_water") %>%
mutate(environment = "water") %>%
select(-behavior)
otters %>%
fuzzy_left_join(otters_water,
by = c("subject", "observation_id", "start_time", "stop_time"),
match_fun = list(`==`, `==`, `>=`, `<=`)) %>%
replace_na(list(environment = "land")) %>%
select(c(observation_id.x:stop_time.x, environment))
Output
observation_id.x subject.x behavior start_time.x stop_time.x environment
1 1 1 o_water 1 5 water
2 1 1 swim 1 3 water
3 1 1 float 2 4 water
4 1 1 o_land 6 10 land
5 1 1 walk 6 9 land
6 1 1 o_water 11 15 water
7 1 1 float 11 14 water
Upvotes: 2
Reputation: 76
Great job on your question.
I think rearranging your dataset will help a lot here. I'd suggest rearranging it so that each time point has only one record (per individual otter, perhaps), and individual behaviors each have their own column, with binary data indicating whether or not that behavior is occurring at each time point.
There's a lot of rearranging that happens in the first few lines; I'd suggest stepping through the code one line at a time just to see how each line moves the data around.
Using the data you provided:
library(tidyverse)
otters_wide <- otters %>%
# first pivot to a longer form, so the time values are all in one column
pivot_longer(cols = c("start_time", "stop_time"), names_to = "start_stop", values_to = "time", names_pattern = "(.*)_time") %>%
# then pivot to w wider format, so each behavior has its own column.
pivot_wider(names_from = "behavior", values_from = "start_stop") %>%
#Then arrange everything in order of time.
arrange(time) %>%
#Fill behavior columns downward
fill(o_water, swim, float, o_land, walk) %>%
#change all "start"s and the first "stop" in each series to "yes", and all other "stop"s to "no"
mutate_at(.vars = c("o_water", "swim", "float", "o_land", "walk"), ~ if_else(. == "start" | lag(.) == "start", "yes", "no")) %>%
# this column is a little redundant now, but here's the water/land column, at last.
mutate(environment = if_else(o_water == "yes", "water", "land"))
otters_wide
# A tibble: 11 x 9
observation_id subject time o_water swim float o_land walk environment
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1 1 yes yes NA NA NA water
2 1 1 2 yes yes yes NA NA water
3 1 1 3 yes yes yes NA NA water
4 1 1 4 yes no yes NA NA water
5 1 1 5 yes no no NA NA water
6 1 1 6 no no no yes yes land
7 1 1 9 no no no yes yes land
8 1 1 10 no no no yes no land
9 1 1 11 yes no yes no no water
10 1 1 14 yes no yes no no water
11 1 1 15 yes no no no no water
In this alternate format you could go even further and include time points that weren't explicitly recorded (e.g. times 7, 8, 12 & 13, in this example) which, once filled in, would make summarizing things like total time spent on each behavior much more straightforward.
Since the behavioral data is binary, those columns could certainly contain logical data rather than character data, but because they started as character data, keeping them that way seemed simplest.
Hope this helps.
Upvotes: 3