Reputation: 197

Take mean of a subset of rows selected using a boolean vector produced for each row

I can do the following using sapply, but I wanted if dplyr can do it as well, or if it's a problem outside the scope of dplyr.

I have three columns, one with dates (Date), one with date intervals (Intervals), and the third with numeric values (Values).

What I want to do is for each row:

1) Get the value Date

2) See inside which intervals Date is present (Date %within% Intervals), and produce the corresponding boolean vector

3) Use this boolean vector to subset the entire original dataframe

4) Take the mean of the Values for this dataframe

5) Store the value in a new column called mean_for_Date

For the following input the values for the column mean_for_Date would be 3 and 1:

library(lubridate)
library(dplyr)
df <- data.frame( Date = as.Date(c("1998-01-02", "1998-01-06")), Values =  c(1, 3))
df$Intervals <- c(interval(df[2, "Date"] - days(2), df[2, "Date"] + days(2)), interval(df[1, "Date"] - days(2), df[1, "Date"] + days(2)))

In sapply I did it like this:

df$mean_for_Date <- sapply(df$Date, function (x) mean(df$Values[x %within% df$Intervals], na.rm = T))

Thanks in advance for all possible advice

Upvotes: 2

Answers (3)

Tung

Reputation: 28411

dplyr::mutate and purrr::map_dbl in the tidyverse can do the same

library(lubridate) # date
library(magrittr)  # %>% and %<>%
library(tidyverse) # mutate and map_dbl

df <- data.frame( Date = as.Date(c("1998-01-02", "1998-01-06")), Values =  c(1, 3))
df$Intervals <- c(interval(df[2, "Date"] - days(2), df[2, "Date"] + days(2)), 
                  interval(df[1, "Date"] - days(2), df[1, "Date"] + days(2))) 

df %<>%
  mutate(mean_for_Date = map_dbl(Date, ~ mean(Values[. %within% Intervals], na.rm = TRUE)))
df

#>         Date Values                      Intervals mean_for_Date
#> 1 1998-01-02      1 1998-01-04 UTC--1998-01-08 UTC             3
#> 2 1998-01-06      3 1997-12-31 UTC--1998-01-04 UTC             1

str(df)
#> 'data.frame':    2 obs. of  4 variables:
#>  $ Date         : Date, format: "1998-01-02" "1998-01-06"
#>  $ Values       : num  1 3
#>  $ Intervals    :Formal class 'Interval' [package "lubridate"] with 3 slots
#>   .. ..@ .Data: num  345600 345600
#>   .. ..@ start: POSIXct, format: "1998-01-04" ...
#>   .. ..@ tzone: chr "UTC"
#>  $ mean_for_Date: num  3 1

Created on 2018-03-10 by the reprex package (v0.2.0).

Upvotes: 2

quickreaction

Reputation: 685

Maybe I'm seeing the problem too simplistically? Would this get the mean(as a function of date) given the dates fall in the interval?

 df <- df %>% 
          left_join(df %>% 
                       filter(Date %within% Intervals & !is.na(Values)) %>% 
                       group_by(Date) %>% 
                       summarise(mean_for_Date = mean(Values)) %>% 
                       select(Date,mean_for_Date)), by="Date")

Upvotes: 0

MKR

Reputation: 20095

One solution can be achieved by using a dummy column to apply cartesian join between each row of data frame df to self-join it with itsefl.

I have taken away few not required columns while preparing to join. The cartesian join provides all possible combinations of Date and Intervals. I have kept the Values along with Interval so that when Date is matched then corresponding Values can be taken.

IsFound decides the rows in which Date is within Intervals. We need to filter only those rows with IsFound true because Values from those will derive mean for that date.

The solution will work if a Date is found within multiple Interval. The group_by will finally help to find mean.

  library(dplyr)
  df %>% select(-Values, - Intervals) %>% mutate(dummy = 1) %>%
    inner_join(select(df, -Date) %>% mutate(dummy = 1), by="dummy") %>%
    mutate(IsFound = Date %within% Intervals) %>%
    filter(IsFound) %>%
    select(-dummy, - IsFound) %>%
    group_by(Date) %>%
    summarise(Mean = mean(Values))

# A tibble: 2 x 2
#  Date        Mean
#  <date>     <dbl>
#1 1998-01-02  3.00
#2 1998-01-06  1.00

Upvotes: 2

Take mean of a subset of rows selected using a boolean vector produced for each row

Answers (3)

Related Questions