Reputation: 33
I have a large dataframe in R with daily time series data of rainfall for a number of locations (each in their own column). I would like to know the number of times the rainfall is less than, or is greater than a threshold value for each location in each month and also by year.
My dataframe is large so I have provided example data here:
Date_ex <- seq.Date(as.Date('2000-01-01'),as.Date('2005-01-31'),by = 1)
A <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
B <- sample(x = c(1, 2, 10), size = 1858, replace = TRUE)
C <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
D <- sample(x = c(1, 3, 4), size = 1858, replace = TRUE)
df <- data.frame(Date_ex, A, B, C, D)
How would I find out the number of times the value in A, B, C and D is greater than 4 for each month and then also for each year.
I think I should then be able to summarise this into two new tables.
One like this (example, ignore numbers):
A B C D
2000-01 1 0 5 0
2000-02 2 16 25 0
2000-03 1 5 26 0
And one like this (example, ignore numbers):
A B C D
2000 44 221 67 0
2001 67 231 4 132
2002 99 111 66 4
2003 33 45 45 4
I think I should be using dplyr for this? But I'm not sure how to get the dates to work.
Upvotes: 2
Views: 804
Reputation: 269905
Here are a few solutions.
1) aggregate This solution uses only base R. The new Date
column is the date for the first of the month or first of the year.
aggregate(df[-1] > 4, list(Date = as.Date(cut(df[[1]], "month"))), sum)
aggregate(df[-1] > 4, list(Date = as.Date(cut(df[[1]], "year"))), sum)
1a) Using yearmon
class from zoo and toyear
from (3) we can write:
library(zoo)
aggregate(df[-1] > 4, list(Date = as.yearmon(df[[1]])), sum)
aggregate(df[-1] > 4, list(Date = toyear(df[[1]])), sum)
2) rowsum This is another base R solution. The year/month or year is given by the row names.
rowsum((df[-1] > 4) + 0, format(df[[1]], "%Y-%m"))
rowsum((df[-1] > 4) + 0, format(df[[1]], "%Y"))
2a) Using yearmon
class from zoo and toyear
from (3) we can write:
library(zoo)
rowsum((df[-1] > 4) + 0, as.yearmon(df[[1]]))
rowsum((df[-1] > 4) + 0, toyear(df[[1]]))
3) aggregate.zoo Convert to a zoo object and use aggregate.zoo
. Note that yearmon
class internally represents a year and month as the year plus 0 for Jan, 1/12 for Feb, 2/12 for March, etc. so taking the integer part gives the year.
library(zoo)
z <- read.zoo(df)
aggregate(z > 4, as.yearmon, sum)
toyear <- function(x) as.integer(as.yearmon(x))
aggregate(z > 4, toyear, sum)
The result is a zoo time series with a yearmon
index in the first case and an integer index in the second. If you want a data frame use fortify.zoo(ag)
where ag
is the result of aggregate
.
4) dplyr toyear
is from (3).
library(dplyr)
library(zoo)
df %>%
group_by(YearMonth = as.yearmon(Date_ex)) %>%
summarize_all(funs(sum)) %>%
ungroup
df %>%
group_by(Year = toyear(Date_ex)) %>%
summarize_all(funs(sum)) %>%
ungroup
Upvotes: 2
Reputation: 23608
Data.table is missing so I'm adding this. Comments are in the code. I used set.seed(1) to generate the samples.
library(data.table)
setDT(df)
# add year and month to df
df[, `:=`(month = month(Date_ex),
year = year(Date_ex))]
# monthly returns, remove date_ex
monthly_dt <- df[,lapply(.SD, function(x) sum(x > 4)), by = .(year, month), .SDcols = -("Date_ex")]
year month A B C D
1: 2000 1 10 10 11 0
2: 2000 2 10 11 8 0
3: 2000 3 11 11 11 0
4: 2000 4 10 11 8 0
5: 2000 5 7 10 8 0
6: 2000 6 9 6 7 0
.....
# yearly returns, remove Date_ex and month
yearly_dt <- df[,lapply(.SD, function(x) sum(x > 4)), by = .(year), .SDcols = -c("Date_ex", "month")]
year A B C D
1: 2000 114 118 113 0
2: 2001 127 129 120 0
3: 2002 122 108 126 0
4: 2003 123 128 125 0
5: 2004 123 132 131 0
6: 2005 14 15 15 0
Upvotes: 1
Reputation: 39154
A solution using the dplyr
and lubridate
package. The key is to create Year
and Month
columns, group by those columns, and use summarise_all
to summarize the data.
# Create the example data frame, set the seed for reproducibility
set.seed(199)
Date_ex <- seq.Date(as.Date('2000-01-01'),as.Date('2005-01-31'),by = 1)
A <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
B <- sample(x = c(1, 2, 10), size = 1858, replace = TRUE)
C <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
D <- sample(x = c(1, 3, 4), size = 1858, replace = TRUE)
df <- data.frame(Date_ex, A, B, C, D)
library(dplyr)
library(lubridate)
# Summarise for each month
df2 <- df %>%
mutate(Year = year(Date_ex), Month = month(Date_ex)) %>%
select(-Date_ex) %>%
group_by(Year, Month) %>%
summarise_all(funs(sum(. > 4))) %>%
ungroup()
df2
# # A tibble: 61 x 6
# Year Month A B C D
# <dbl> <dbl> <int> <int> <int> <int>
# 1 2000 1 13 8 13 0
# 2 2000 2 12 7 8 0
# 3 2000 3 7 9 9 0
# 4 2000 4 9 12 10 0
# 5 2000 5 11 12 8 0
# 6 2000 6 12 9 16 0
# 7 2000 7 10 11 10 0
# 8 2000 8 8 12 14 0
# 9 2000 9 12 12 12 0
# 10 2000 10 9 9 7 0
# # ... with 51 more rows
# Summarise for each year and month
df3 <- df %>%
mutate(Year = year(Date_ex)) %>%
select(-Date_ex) %>%
group_by(Year) %>%
summarise_all(funs(sum(. > 4)))
df3
# # A tibble: 6 x 5
# Year A B C D
# <dbl> <int> <int> <int> <int>
# 1 2000 120 119 125 0
# 2 2001 119 123 113 0
# 3 2002 135 122 105 0
# 4 2003 114 112 104 0
# 5 2004 115 125 124 0
# 6 2005 9 14 11 0
Upvotes: 2