Reputation: 3805
I have 4 data frames, each data frame corresponding to a single year. Each data frame contains daily rainfall for five locations.
Generate sample data
location <- c("A","B","C","D","E")
mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits=2)
dat.1981 <-as.data.frame(cbind(location,mat)) # rainfall for 1981
dat.1981$year <- 1981
mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits = 2)
dat.1982 <- as.data.frame(cbind(location,mat)) # rainfall for 1982
dat.1982$year <- 1982
mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits = 2)
dat.1983 <-as.data.frame(cbind(location,mat)) # rainfall for 1983
dat.1983$year <- 1983
mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits = 2)
dat.1984 <-as.data.frame(cbind(location,mat)) # rainfall for 1984
dat.1984$year <- 1984
dat <- as.data.frame(rbind(dat.1981,dat.1982,dat.1983,dat.1984))
For each year, I want to classify whether a day was an extreme wet day or not
Here's how I do my calculation:
1) For each location, generate the mean and sd of rainfall for every week for the period 1981 to 1984. For example, in location A the mean rainfall for the first week will be:
(First week rain 1981 in A + First week rain 1982 in A + First week rain 1983 in A + First week rain 1984 in A)/4
which can be written in R as
mean.week1.loc1 <- mean(rowSums(dat[dat$location=="A",2:8])) # 2:8 selects the first 7 days in each year
sd.week1.loc1 <- sd(rowSums(dat[dat$location=="A",2:8]))
wet.cr <- mean.week1 + sd.week1 # this is my threshold for defining a wet day
If daily rainfall in week 1 for the years 1981 to 1984 in location A is greater than wet.cr
,
that day is a wet day and hence gets a value of 1
As an example, to examine whether rainfall of week 1 for location A for 1981 to 1984 is a wet day or not I can do the following:
lapply(dat[, 2:8], function(x) ifelse(x > wet.cr, 1, 0))
I want to repeat this for each week and each location.
However, I am unable to stitch these individual functions together and also
my final results should be a dataframe same as dat
but instead of rainfall values, I will have 1 or 0 defining whether it is a wet day or not.
EDIT
The solutions below gives me the following:
mean(c(rainfall 1981 day 1 week 1, ...., rainfall 1981 day 7 week 1, rainfall 1982 day 1 week 1,....,rainfall 1982 day 7 week 1,....,rainfall 1984 day 1 week 1,....,rainfall 1984 day 7 week 1))
WHAT I WANT IS:
mean(c(mean(total rainfall week 1 1981), mean(total rainfall week 1 1982), mean(total rainfall week 1 1983), mean(total rainfall week 1 1984)))
I hope this is clear now.
Upvotes: 0
Views: 2034
Reputation: 28331
A tidyverse
solution
library(magrittr)
library(tidyverse)
dat_m <- gather(dat, day, rainfall, -location, -year)
str(dat_m)
dat_m %<>%
mutate(day = gsub("V", "", day)) %>%
mutate(day = as.numeric(day)) %>%
mutate(week = as.integer(ceiling(day/7))) %>%
group_by(location, week) %>%
mutate(wet.cr = mean(rainfall, na.rm = TRUE) + sd(rainfall, na.rm = TRUE) ) %>%
mutate(indication = ifelse(rainfall > wet.cr, 1, 0)) %>%
ungroup()
dat_m
# A tibble: 7,300 x 7
location year day rainfall week wet.cr indication
<fctr> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
1 A 1981 1.00 0.880 1 0.845 1.00
2 B 1981 1.00 0.850 1 0.829 1.00
3 C 1981 1.00 1.00 1 0.877 1.00
4 D 1981 1.00 0.100 1 0.755 0
5 E 1981 1.00 0.190 1 0.750 0
6 A 1982 1.00 0.380 1 0.845 0
7 B 1982 1.00 0.760 1 0.829 0
8 C 1982 1.00 0.940 1 0.877 1.00
9 D 1982 1.00 0.900 1 0.755 1.00
10 E 1982 1.00 0.600 1 0.750 0
# ... with 7,290 more rows
Edit: For rainfall, I think it's better to use sum
(total) than mean
So we first calculate the total weekly rainfall for every year. Then we calculate the long-term average & stdev of the total weekly rainfall.
dat_m %<>%
mutate(day = as.numeric(gsub("V", "", day)),
week = as.integer(ceiling(day/7))) %>%
group_by(location, week, year) %>%
mutate(total_weekly_rainfall = sum(rainfall, na.rm = TRUE)) %>%
ungroup() %>%
group_by(location, week) %>%
mutate(mean_weekly_rainfall = sum(rainfall, na.rm = TRUE)/length(unique(year)),
stddev_weekly_rainfall = sd(rainfall, na.rm = TRUE),
wet.cr = mean_weekly_rainfall + stddev_weekly_rainfall,
indication = ifelse(total_weekly_rainfall > wet.cr, 1, 0)) %>%
arrange(location, year, day) %>%
ungroup() %>%
distinct(location, year, week, .keep_all = TRUE)
dat_m
# A tibble: 1,060 x 10
location year day rainfall week total_wee~ mean_wee~ stddev_w~ wet.~ indic~
<fctr> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1981 1.00 0.880 1 0.880 0.630 0.277 0.907 0
2 A 1981 8.00 0.190 2 0.190 0.330 0.431 0.761 0
3 A 1981 15.0 0.630 3 0.630 0.548 0.331 0.878 0
4 A 1981 22.0 0.0300 4 0.0300 0.290 0.259 0.549 0
5 A 1981 29.0 0.360 5 0.360 0.308 0.196 0.504 0
6 A 1981 36.0 0.540 6 0.540 0.500 0.225 0.725 0
7 A 1981 43.0 0.0300 7 0.0300 0.375 0.289 0.664 0
8 A 1981 50.0 0.170 8 0.170 0.332 0.375 0.708 0
9 A 1981 57.0 0.260 9 0.260 0.652 0.272 0.924 0
10 A 1981 64.0 0.590 10 0.590 0.512 0.202 0.715 0
# ... with 1,050 more rows
Upvotes: 3
Reputation: 3923
For both the data.table and the tidyverse solutions you may be well served to treat this as a scaling exercise (a z score in many disciplines), since the mean + n standard deviation is a well known benchmark.
For the data.table solution you would:
newdat[,zrain := scale(rain), by = .(location,Week)]
newdat[,zwet := ifelse(zrain > 1.0,1,0)]
where you rely on scale
from base and compare to 1.0
For the tidyverse that becomes:
mutate(zrain = scale(rainfall)) %>%
mutate(indication = ifelse(zrain > 1.0, 1, 0)) %>%
That way in the future if your standard for what "wet" is changes you only have to change one number in your code
Upvotes: 0
Reputation: 5673
using data.table :
library(data.table)
dat <- setDT(dat)
newdat <- melt(dat, measure.vars = patterns("^V"),variable.name = "day",value.name = "rain")
newdat[,day := as.character(day)]
newdat[,day := as.numeric(unlist(lapply(newdat$day,function(x){strsplit(x,"V")[[1]][2]})))]
newdat[,Week := day %/% 7]
newdat[,threshold := mean(rain) + sd(rain), by = .(location,Week)]
newdat[,wet := ifelse(rain > threshold,1,0)]
print(newdat,topn = 100)
location year day rain Week threshold wet
1: A 1981 1 0.73 0 0.7630065 0
2: B 1981 1 0.69 0 0.8599243 0
3: C 1981 1 0.45 0 0.8145956 0
4: D 1981 1 0.51 0 0.8935058 0
5: E 1981 1 0.77 0 0.6992752 1
6: A 1982 1 0.47 0 0.7630065 0
7: B 1982 1 0.70 0 0.8599243 0
8: C 1982 1 0.48 0 0.8145956 0
9: D 1982 1 0.92 0 0.8935058 1
a step by step explanation: first you need to change format of your data to ease the calculus. The long format is more appropriate, as each column V## is actually a variable that is the number day. This is done using melt
melt(dat, measure.vars = patterns("^V"),variable.name = "day",value.name = "rain")
location year day rain
1: A 1981 V1 0.73
2: B 1981 V1 0.69
3: C 1981 V1 0.45
4: D 1981 V1 0.51
5: E 1981 V1 0.77
---
7296: A 1984 V365 0.31
7297: B 1984 V365 0.99
7298: C 1984 V365 0.25
7299: D 1984 V365 0.24
7300: E 1984 V365 0.87
Then you tranform your day to a real number, in order to be able to calculate the week
newdat[,day := as.character(day)]
newdat[,day := as.numeric(unlist(lapply(newdat$day,function(x){strsplit(x,"V")[[1]][2]})))]
> newdat[,.(day,year)]
day year
1: 1 1981
2: 1 1981
3: 1 1981
4: 1 1981
5: 1 1981
Then calculate the week number the same as you do
newdat[,Week := day %/% 7]
The statistic for the calculus of the sthreshold is done by grouping on the weeks and places (so statistic over the year for each place)
newdat[,threshold := mean(rain) + sd(rain), by = .(location,Week)]
and define your wet day as the day when rain is higher than the threshold
newdat[,wet := ifelse(rain > threshold,1,0)]
but I agree on the comment that the initial data was surely in a better format than what you are presenting.
Upvotes: 1