hbungum
hbungum

Reputation: 23

mutate new column based on conditions in another row in R

I am working with a dataset of animal behaviors, and am trying to create a new column ("environment") based on conditions fulfilled in another row. Specifically, I want the new column to return "water" if the behavior falls between the start/stop times of the behavior "o_water", and "land" if it falls outside these bounds. If this is unclear here is a minimal example:

library(dplyr) 
library(magrittr)

otters <- data.frame(
  observation_id = 1,
  subject = 1,
  behavior = c("o_water", "swim", "float", "o_land", "walk", "o_water", "float"),
  start_time = c(1,1,2,6,6,11,11),
  stop_time = c(5,3,4,10,9,15,14)
)

#this does it, but manually. need to go over very large dataset and search for conditions
otters <- otters %>%
  group_by(subject, observation_id, behavior) %>%
  mutate(environment = ifelse((start_time >= 1 & stop_time <= 5) | 
                                (start_time >= 11 & stop_time <= 15), "water", "land"))

This is the output desired.

Groups:   subject, observation_id, behavior [5]
  observation_id subject behavior start_time stop_time environment
           <dbl>   <dbl> <fct>         <dbl>     <dbl> <chr>      
1              1       1 o_water           1         5 water      
2              1       1 swim              1         3 water      
3              1       1 float             2         4 water      
4              1       1 o_land            6        10 land       
5              1       1 walk              6         9 land       
6              1       1 o_water          11        15 water      
7              1       1 float            11        14 water

The second set of commands is sort of what I want, but I need this to search out and apply it to an entire dataset rather than typing out each parameter. The grouping is so the functions are performed over the applicable rows; in the full dataset, there are multiple subjects and observation_id's.

I've tried using when() and case_when() to no avail, but I am very novice level at R so would appreciate any help!

Apologies for any missteps I've done. I haven't been able to find a problem quite like this elsewhere on stackoverflow.

Upvotes: 2

Views: 1263

Answers (3)

MarBlo
MarBlo

Reputation: 4524

Here is another approach, in which I extract all the start_ and stop_times of each behavior of o_water into a list. As there are 2 entries for o_water these lists have two elements. With the help of purrr I create a list pu of vectors which span the start_ and stop_time for each entry.

purrr also helps to find which row in otters falls within the sequences defined in pu. l_true is the sum of all columns which have a FALSE/TRUE, it contains either a 1, if there is one entry matching the time sequence or a 0 if it does not.

This list is bound to otters and with ifelse the new column is named.

library(tidyverse)

otters <- data.frame(
  observation_id = 1,
  subject = 1,
  behavior = c("o_water", "swim", "float", "o_land", "walk", "o_water", "float"),
  start_time = c(1,1,2,6,6,11,11),
  stop_time = c(5,3,4,10,9,15,14)
)
# otters
# find all start_times of 'o_water'
otters %>%
  dplyr::filter(grepl('water', behavior)) %>% 
  select(ends_with('time')) %>% 
  `[[`(1) -> start
start
#> [1]  1 11

# find all stop_times
otters %>%
  dplyr::filter(grepl('water', behavior)) %>% 
  select(ends_with('time')) %>% 
  `[[`(2) -> stop
stop
#> [1]  5 15

# bring start and stop_times together in one
# list per 'o_water'
pu <- purrr::map2(start, stop, ~ .x : .y)
pu
#> [[1]]
#> [1] 1 2 3 4 5
#> 
#> [[2]]
#> [1] 11 12 13 14 15

# check with pu, if start_ and stop_time of each row is
# in pu, and combine the row of FALSE/TRUE into a list
l_true <- map_dfc(pu, ~ otters$start_time %in% . & otters$stop_time %in% .) %>% 
  mutate(l = rowSums(.[1:ncol(.)])) %>% `[[`(ncol(.))


otters %>% 
  cbind(., l_true) %>% 
  mutate(ev = ifelse(l_true == 1, 'water', 'land'))
#>   observation_id subject behavior start_time stop_time l_true    ev
#> 1              1       1  o_water          1         5      1 water
#> 2              1       1     swim          1         3      1 water
#> 3              1       1    float          2         4      1 water
#> 4              1       1   o_land          6        10      0  land
#> 5              1       1     walk          6         9      0  land
#> 6              1       1  o_water         11        15      1 water
#> 7              1       1    float         11        14      1 water

Upvotes: 1

Ben
Ben

Reputation: 30474

Here is another approach with dplyr that also uses fuzzyjoin package.

You can separate your o_water behavior rows from otters and designate the environment as water.

Then, with fuzzy_left_join, merge the o_water rows with the rest of your data, where the start_time and end_time fall between the o_water range.

The remaining NA in environment will be non-merged rows, which can be land or other designation.

library(dplyr)
library(fuzzyjoin)

otters_water <- otters %>%
  filter(behavior == "o_water") %>%
  mutate(environment = "water") %>%
  select(-behavior)

otters %>%
  fuzzy_left_join(otters_water, 
                  by = c("subject", "observation_id", "start_time", "stop_time"), 
                  match_fun = list(`==`, `==`, `>=`, `<=`)) %>%
  replace_na(list(environment = "land")) %>%
  select(c(observation_id.x:stop_time.x, environment))

Output

  observation_id.x subject.x behavior start_time.x stop_time.x environment
1                1         1  o_water            1           5       water
2                1         1     swim            1           3       water
3                1         1    float            2           4       water
4                1         1   o_land            6          10        land
5                1         1     walk            6           9        land
6                1         1  o_water           11          15       water
7                1         1    float           11          14       water

Upvotes: 2

stribstrib
stribstrib

Reputation: 76

Great job on your question.

I think rearranging your dataset will help a lot here. I'd suggest rearranging it so that each time point has only one record (per individual otter, perhaps), and individual behaviors each have their own column, with binary data indicating whether or not that behavior is occurring at each time point.

There's a lot of rearranging that happens in the first few lines; I'd suggest stepping through the code one line at a time just to see how each line moves the data around.

Using the data you provided:

library(tidyverse)

otters_wide <- otters %>% 
  # first pivot to a longer form, so the time values are all in one column
  pivot_longer(cols = c("start_time", "stop_time"), names_to = "start_stop", values_to = "time", names_pattern = "(.*)_time") %>%
  # then pivot to w wider format, so each behavior has its own column. 
  pivot_wider(names_from = "behavior", values_from = "start_stop") %>%
  #Then arrange everything in order of time.
  arrange(time) %>%
  #Fill behavior columns downward
  fill(o_water, swim, float, o_land, walk) %>%
  #change all "start"s and the first "stop" in each series to "yes", and all other "stop"s to "no"
  mutate_at(.vars = c("o_water", "swim", "float", "o_land", "walk"), ~ if_else(. == "start" | lag(.) == "start", "yes", "no")) %>% 
  # this column is a little redundant now, but here's the water/land column, at last.
  mutate(environment = if_else(o_water == "yes", "water", "land"))
otters_wide

# A tibble: 11 x 9
   observation_id subject  time o_water swim  float o_land walk  environment
            <dbl>   <dbl> <dbl> <chr>   <chr> <chr> <chr>  <chr> <chr>      
 1              1       1     1 yes     yes   NA    NA     NA    water      
 2              1       1     2 yes     yes   yes   NA     NA    water      
 3              1       1     3 yes     yes   yes   NA     NA    water      
 4              1       1     4 yes     no    yes   NA     NA    water      
 5              1       1     5 yes     no    no    NA     NA    water      
 6              1       1     6 no      no    no    yes    yes   land       
 7              1       1     9 no      no    no    yes    yes   land       
 8              1       1    10 no      no    no    yes    no    land       
 9              1       1    11 yes     no    yes   no     no    water      
10              1       1    14 yes     no    yes   no     no    water      
11              1       1    15 yes     no    no    no     no    water

In this alternate format you could go even further and include time points that weren't explicitly recorded (e.g. times 7, 8, 12 & 13, in this example) which, once filled in, would make summarizing things like total time spent on each behavior much more straightforward.

Since the behavioral data is binary, those columns could certainly contain logical data rather than character data, but because they started as character data, keeping them that way seemed simplest.

Hope this helps.

Upvotes: 3

Related Questions