Reputation: 3805

R: calculating mean and sd for each week of the year

I have 4 data frames, each data frame corresponding to a single year. Each data frame contains daily rainfall for five locations.

Generate sample data

    location <- c("A","B","C","D","E")
    mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits=2)
    dat.1981 <-as.data.frame(cbind(location,mat)) # rainfall for 1981
    dat.1981$year <- 1981

    mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits = 2)
    dat.1982 <- as.data.frame(cbind(location,mat)) # rainfall for 1982
    dat.1982$year <- 1982

    mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits = 2)
    dat.1983 <-as.data.frame(cbind(location,mat)) # rainfall for 1983
    dat.1983$year <- 1983

    mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits = 2)
    dat.1984 <-as.data.frame(cbind(location,mat)) # rainfall for 1984
    dat.1984$year <- 1984

    dat <- as.data.frame(rbind(dat.1981,dat.1982,dat.1983,dat.1984))

For each year, I want to classify whether a day was an extreme wet day or not

Here's how I do my calculation:

1) For each location, generate the mean and sd of rainfall for every week for the period 1981 to 1984. For example, in location A the mean rainfall for the first week will be:

(First week rain 1981 in A + First week rain 1982 in A + First week rain 1983 in A + First week rain 1984 in A)/4

which can be written in R as

    mean.week1.loc1 <- mean(rowSums(dat[dat$location=="A",2:8])) # 2:8 selects the first 7 days in each year
    sd.week1.loc1 <- sd(rowSums(dat[dat$location=="A",2:8])) 

    wet.cr <- mean.week1 + sd.week1 # this is my threshold for defining a wet day

If daily rainfall in week 1 for the years 1981 to 1984 in location A is greater than wet.cr, that day is a wet day and hence gets a value of 1

As an example, to examine whether rainfall of week 1 for location A for 1981 to 1984 is a wet day or not I can do the following:

   lapply(dat[, 2:8], function(x) ifelse(x > wet.cr, 1, 0))

I want to repeat this for each week and each location.

However, I am unable to stitch these individual functions together and also my final results should be a dataframe same as dat but instead of rainfall values, I will have 1 or 0 defining whether it is a wet day or not.

EDIT

The solutions below gives me the following:

mean(c(rainfall 1981 day 1 week 1, ...., rainfall 1981 day 7 week 1, rainfall 1982 day 1 week 1,....,rainfall 1982 day 7 week 1,....,rainfall 1984 day 1 week 1,....,rainfall 1984 day 7 week 1))

WHAT I WANT IS:

mean(c(mean(total rainfall week 1 1981), mean(total rainfall week 1 1982), mean(total rainfall week 1 1983), mean(total rainfall week 1 1984)))

I hope this is clear now.

Upvotes: 0

Answers (3)

Tung

Reputation: 28331

A tidyverse solution

    library(magrittr)
    library(tidyverse)

    dat_m <- gather(dat, day, rainfall, -location, -year)
    str(dat_m)

    dat_m %<>%
      mutate(day = gsub("V", "", day)) %>%
      mutate(day = as.numeric(day)) %>% 
      mutate(week = as.integer(ceiling(day/7))) %>% 
      group_by(location, week) %>% 
      mutate(wet.cr = mean(rainfall, na.rm = TRUE) + sd(rainfall, na.rm = TRUE) ) %>% 
      mutate(indication = ifelse(rainfall > wet.cr, 1, 0)) %>% 
      ungroup()
    dat_m 

    # A tibble: 7,300 x 7
       location  year   day rainfall  week wet.cr indication
       <fctr>   <dbl> <dbl>    <dbl> <int>  <dbl>      <dbl>
     1 A         1981  1.00    0.880     1  0.845       1.00
     2 B         1981  1.00    0.850     1  0.829       1.00
     3 C         1981  1.00    1.00      1  0.877       1.00
     4 D         1981  1.00    0.100     1  0.755       0   
     5 E         1981  1.00    0.190     1  0.750       0   
     6 A         1982  1.00    0.380     1  0.845       0   
     7 B         1982  1.00    0.760     1  0.829       0   
     8 C         1982  1.00    0.940     1  0.877       1.00
     9 D         1982  1.00    0.900     1  0.755       1.00
    10 E         1982  1.00    0.600     1  0.750       0   
    # ... with 7,290 more rows

Edit: For rainfall, I think it's better to use sum (total) than mean

So we first calculate the total weekly rainfall for every year. Then we calculate the long-term average & stdev of the total weekly rainfall.

    dat_m %<>%
      mutate(day = as.numeric(gsub("V", "", day)),
             week = as.integer(ceiling(day/7))) %>%
      group_by(location, week, year) %>% 
      mutate(total_weekly_rainfall = sum(rainfall, na.rm = TRUE)) %>% 
      ungroup() %>% 
      group_by(location, week) %>% 
      mutate(mean_weekly_rainfall = sum(rainfall, na.rm = TRUE)/length(unique(year)),
             stddev_weekly_rainfall = sd(rainfall, na.rm = TRUE),
             wet.cr =  mean_weekly_rainfall + stddev_weekly_rainfall,
             indication = ifelse(total_weekly_rainfall > wet.cr, 1, 0)) %>% 
      arrange(location, year, day) %>% 
      ungroup() %>% 
      distinct(location, year, week, .keep_all = TRUE)
    dat_m 

    # A tibble: 1,060 x 10
       location  year   day rainfall  week total_wee~ mean_wee~ stddev_w~ wet.~ indic~
       <fctr>   <dbl> <dbl>    <dbl> <int>      <dbl>     <dbl>     <dbl> <dbl>  <dbl>
     1 A         1981  1.00   0.880      1     0.880      0.630     0.277 0.907      0
     2 A         1981  8.00   0.190      2     0.190      0.330     0.431 0.761      0
     3 A         1981 15.0    0.630      3     0.630      0.548     0.331 0.878      0
     4 A         1981 22.0    0.0300     4     0.0300     0.290     0.259 0.549      0
     5 A         1981 29.0    0.360      5     0.360      0.308     0.196 0.504      0
     6 A         1981 36.0    0.540      6     0.540      0.500     0.225 0.725      0
     7 A         1981 43.0    0.0300     7     0.0300     0.375     0.289 0.664      0
     8 A         1981 50.0    0.170      8     0.170      0.332     0.375 0.708      0
     9 A         1981 57.0    0.260      9     0.260      0.652     0.272 0.924      0
    10 A         1981 64.0    0.590     10     0.590      0.512     0.202 0.715      0
    # ... with 1,050 more rows

Upvotes: 3

Chuck P

Reputation: 3923

For both the data.table and the tidyverse solutions you may be well served to treat this as a scaling exercise (a z score in many disciplines), since the mean + n standard deviation is a well known benchmark.

For the data.table solution you would:

newdat[,zrain := scale(rain),  by = .(location,Week)]
newdat[,zwet := ifelse(zrain > 1.0,1,0)]

where you rely on scale from base and compare to 1.0

For the tidyverse that becomes:

mutate(zrain = scale(rainfall)) %>% 
mutate(indication = ifelse(zrain > 1.0, 1, 0)) %>%

That way in the future if your standard for what "wet" is changes you only have to change one number in your code

Upvotes: 0

denis

Reputation: 5673

using data.table :

library(data.table)
dat <- setDT(dat)
newdat <- melt(dat, measure.vars = patterns("^V"),variable.name = "day",value.name = "rain")
newdat[,day := as.character(day)]
newdat[,day := as.numeric(unlist(lapply(newdat$day,function(x){strsplit(x,"V")[[1]][2]})))]
newdat[,Week := day %/% 7]
newdat[,threshold := mean(rain) + sd(rain),  by = .(location,Week)]
newdat[,wet := ifelse(rain > threshold,1,0)]
print(newdat,topn = 100)


      location year day rain Week threshold wet
   1:        A 1981   1 0.73    0 0.7630065   0
   2:        B 1981   1 0.69    0 0.8599243   0
   3:        C 1981   1 0.45    0 0.8145956   0
   4:        D 1981   1 0.51    0 0.8935058   0
   5:        E 1981   1 0.77    0 0.6992752   1
   6:        A 1982   1 0.47    0 0.7630065   0
   7:        B 1982   1 0.70    0 0.8599243   0
   8:        C 1982   1 0.48    0 0.8145956   0
   9:        D 1982   1 0.92    0 0.8935058   1

a step by step explanation: first you need to change format of your data to ease the calculus. The long format is more appropriate, as each column V## is actually a variable that is the number day. This is done using melt

melt(dat, measure.vars = patterns("^V"),variable.name = "day",value.name = "rain")

     location year  day rain
   1:        A 1981   V1 0.73
   2:        B 1981   V1 0.69
   3:        C 1981   V1 0.45
   4:        D 1981   V1 0.51
   5:        E 1981   V1 0.77
  ---                        
7296:        A 1984 V365 0.31
7297:        B 1984 V365 0.99
7298:        C 1984 V365 0.25
7299:        D 1984 V365 0.24
7300:        E 1984 V365 0.87

Then you tranform your day to a real number, in order to be able to calculate the week

newdat[,day := as.character(day)]
newdat[,day := as.numeric(unlist(lapply(newdat$day,function(x){strsplit(x,"V")[[1]][2]})))]
> newdat[,.(day,year)]
      day year
   1:   1 1981
   2:   1 1981
   3:   1 1981
   4:   1 1981
   5:   1 1981

Then calculate the week number the same as you do

newdat[,Week := day %/% 7]

The statistic for the calculus of the sthreshold is done by grouping on the weeks and places (so statistic over the year for each place)

newdat[,threshold := mean(rain) + sd(rain), by = .(location,Week)]

and define your wet day as the day when rain is higher than the threshold

newdat[,wet := ifelse(rain > threshold,1,0)]

but I agree on the comment that the initial data was surely in a better format than what you are presenting.

Upvotes: 1

R: calculating mean and sd for each week of the year

Answers (3)

Related Questions