Reputation: 33

Count number of times in a month and year that time series data is above a threshold

I have a large dataframe in R with daily time series data of rainfall for a number of locations (each in their own column). I would like to know the number of times the rainfall is less than, or is greater than a threshold value for each location in each month and also by year.

My dataframe is large so I have provided example data here:

Date_ex <- seq.Date(as.Date('2000-01-01'),as.Date('2005-01-31'),by = 1)
A <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
B <- sample(x = c(1, 2, 10), size = 1858, replace = TRUE)
C <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
D <- sample(x = c(1, 3, 4), size = 1858, replace = TRUE)

df <- data.frame(Date_ex, A, B, C, D)

How would I find out the number of times the value in A, B, C and D is greater than 4 for each month and then also for each year.

I think I should then be able to summarise this into two new tables.

One like this (example, ignore numbers):

           A     B     C     D
 2000-01   1     0     5     0
 2000-02   2     16    25    0
 2000-03   1     5     26    0

And one like this (example, ignore numbers):

       A   B     C    D
2000   44  221   67  0
2001   67  231   4   132
2002   99  111   66  4
2003   33  45    45  4

I think I should be using dplyr for this? But I'm not sure how to get the dates to work.

Upvotes: 2

Answers (3)

G. Grothendieck

Reputation: 269905

Here are a few solutions.

1) aggregate This solution uses only base R. The new Date column is the date for the first of the month or first of the year.

aggregate(df[-1] > 4, list(Date = as.Date(cut(df[[1]], "month"))), sum)
aggregate(df[-1] > 4, list(Date = as.Date(cut(df[[1]], "year"))), sum)

1a) Using yearmon class from zoo and toyear from (3) we can write:

library(zoo)

aggregate(df[-1] > 4, list(Date = as.yearmon(df[[1]])), sum)
aggregate(df[-1] > 4, list(Date = toyear(df[[1]])), sum)

2) rowsum This is another base R solution. The year/month or year is given by the row names.

rowsum((df[-1] > 4) + 0, format(df[[1]], "%Y-%m"))
rowsum((df[-1] > 4) + 0, format(df[[1]], "%Y"))

2a) Using yearmon class from zoo and toyear from (3) we can write:

library(zoo)

rowsum((df[-1] > 4) + 0, as.yearmon(df[[1]]))
rowsum((df[-1] > 4) + 0, toyear(df[[1]]))

3) aggregate.zoo Convert to a zoo object and use aggregate.zoo. Note that yearmon class internally represents a year and month as the year plus 0 for Jan, 1/12 for Feb, 2/12 for March, etc. so taking the integer part gives the year.

library(zoo)
z <- read.zoo(df)

aggregate(z > 4, as.yearmon, sum)

toyear <- function(x) as.integer(as.yearmon(x))
aggregate(z > 4, toyear, sum)

The result is a zoo time series with a yearmon index in the first case and an integer index in the second. If you want a data frame use fortify.zoo(ag) where ag is the result of aggregate.

4) dplyr toyear is from (3).

library(dplyr)
library(zoo)

df %>% 
   group_by(YearMonth = as.yearmon(Date_ex)) %>% 
   summarize_all(funs(sum)) %>%
   ungroup

df %>% 
   group_by(Year = toyear(Date_ex)) %>% 
   summarize_all(funs(sum)) %>%
   ungroup

Upvotes: 2

phiver

Reputation: 23608

Data.table is missing so I'm adding this. Comments are in the code. I used set.seed(1) to generate the samples.

library(data.table)

setDT(df)

# add year and month to df
df[, `:=`(month = month(Date_ex), 
          year = year(Date_ex))]

# monthly returns, remove date_ex
monthly_dt <- df[,lapply(.SD, function(x) sum(x > 4)), by = .(year, month), .SDcols = -("Date_ex")]
year month  A  B  C D
1: 2000     1 10 10 11 0
2: 2000     2 10 11  8 0
3: 2000     3 11 11 11 0
4: 2000     4 10 11  8 0
5: 2000     5  7 10  8 0
6: 2000     6  9  6  7 0
.....

# yearly returns, remove Date_ex and month
yearly_dt <- df[,lapply(.SD, function(x) sum(x > 4)), by = .(year), .SDcols = -c("Date_ex", "month")]
year   A   B   C D
1: 2000 114 118 113 0
2: 2001 127 129 120 0
3: 2002 122 108 126 0
4: 2003 123 128 125 0
5: 2004 123 132 131 0
6: 2005  14  15  15 0

Upvotes: 1

www

Reputation: 39154

A solution using the dplyr and lubridate package. The key is to create Year and Month columns, group by those columns, and use summarise_all to summarize the data.

# Create the example data frame, set the seed for reproducibility 
set.seed(199)

Date_ex <- seq.Date(as.Date('2000-01-01'),as.Date('2005-01-31'),by = 1)
A <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
B <- sample(x = c(1, 2, 10), size = 1858, replace = TRUE)
C <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
D <- sample(x = c(1, 3, 4), size = 1858, replace = TRUE)

df <- data.frame(Date_ex, A, B, C, D)

library(dplyr)
library(lubridate)

# Summarise for each month
df2 <- df %>%
  mutate(Year = year(Date_ex), Month = month(Date_ex)) %>%
  select(-Date_ex) %>%
  group_by(Year, Month) %>%
  summarise_all(funs(sum(. > 4))) %>%
  ungroup()
df2
# # A tibble: 61 x 6
#     Year Month     A     B     C     D
#    <dbl> <dbl> <int> <int> <int> <int>
#  1  2000     1    13     8    13     0
#  2  2000     2    12     7     8     0
#  3  2000     3     7     9     9     0
#  4  2000     4     9    12    10     0
#  5  2000     5    11    12     8     0
#  6  2000     6    12     9    16     0
#  7  2000     7    10    11    10     0
#  8  2000     8     8    12    14     0
#  9  2000     9    12    12    12     0
# 10  2000    10     9     9     7     0
# # ... with 51 more rows

# Summarise for each year and month
df3 <- df %>%
  mutate(Year = year(Date_ex)) %>%
  select(-Date_ex) %>%
  group_by(Year) %>%
  summarise_all(funs(sum(. > 4)))
df3
# # A tibble: 6 x 5
#    Year     A     B     C     D
#   <dbl> <int> <int> <int> <int>
# 1  2000   120   119   125     0
# 2  2001   119   123   113     0
# 3  2002   135   122   105     0
# 4  2003   114   112   104     0
# 5  2004   115   125   124     0
# 6  2005     9    14    11     0

Upvotes: 2

Count number of times in a month and year that time series data is above a threshold

Answers (3)

Related Questions