Reputation: 1932
I'm working with some daily rainfall data that spans several years. I want to sum the rainfall on consecutive rainy day to get a rainfall total for that rainfall event. It would also be nice to get a start and stop date and rainfall intensity per event. I'm thinking I could hack something together with aggregate
however what I'm thinking of doing in my head seems very bulky. Is there a quick and elegant solution possibly to be found with dplyr
,tdyr
or data.table
.
Data
structure(list(Time = structure(c(1353398400, 1353484800, 1353571200,
1353657600, 1353744000, 1353830400, 1353916800, 1354003200, 1354089600,
1354176000, 1354262400, 1354348800, 1354435200, 1354521600, 1354608000,
1354694400, 1354780800, 1354867200, 1354953600, 1355040000, 1355126400,
1355212800, 1355299200, 1355385600, 1355472000, 1355558400, 1355644800,
1355731200, 1355817600, 1355904000, 1355990400, 1356076800, 1356163200,
1356249600, 1356336000, 1356422400, 1356508800, 1356595200, 1356681600,
1356768000, 1356854400, 1356940800, 1357027200, 1357113600, 1357200000,
1357286400, 1357372800, 1357459200, 1357545600, 1357632000, 1357718400
), class = c("POSIXct", "POSIXt"), tzone = ""), inc = c(NA, NA,
NA, NA, NA, NA, NA, 0.11, NA, 0.62, 0.0899999999999999, 0.39,
NA, NA, 0.03, NA, NA, NA, NA, NA, NA, 0.34, NA, NA, NA, NA, 0.0600000000000001,
0.02, NA, NA, NA, 0.29, 0.35, 0.02, 0.27, 0.17, 0.0600000000000001,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.47, NA, NA, NA, 0.0300000000000002
)), .Names = c("Time", "inc"), row.names = 50:100, class = "data.frame")
Desired output
Begin End Days Total Intensity
11/27/2012 11/27/2012 1 0.11 0.11
11/29/2012 12/1/2012 3 1.1 0.366666667
12/4/2012 12/4/2012 1 0.03 0.03
12/11/2012 12/11/2012 1 0.34 0.34
12/16/2012 12/17/2012 2 0.08 0.04
12/21/2012 12/26/2012 6 0.29 0.048333333
1/5/2013 1/5/2013 1 0.47 0.47
1/9/2013 1/9/2013 1 0.03 0.03
Upvotes: 0
Views: 297
Reputation: 740
Only base packages, and basically using aggregate function. I know it is not the nicest option around. The only problem is with the format of dates (the columns of data frame must be specified one-by-one for the desired date format, otherwise it will be converted to integer):
data1 <- structure(list(Time = structure(c(1353398400, 1353484800, 1353571200,
1353657600, 1353744000, 1353830400, 1353916800, 1354003200, 1354089600,
1354176000, 1354262400, 1354348800, 1354435200, 1354521600, 1354608000,
1354694400, 1354780800, 1354867200, 1354953600, 1355040000, 1355126400,
1355212800, 1355299200, 1355385600, 1355472000, 1355558400, 1355644800,
1355731200, 1355817600, 1355904000, 1355990400, 1356076800, 1356163200,
1356249600, 1356336000, 1356422400, 1356508800, 1356595200, 1356681600,
1356768000, 1356854400, 1356940800, 1357027200, 1357113600, 1357200000,
1357286400, 1357372800, 1357459200, 1357545600, 1357632000, 1357718400
), class = c("POSIXct", "POSIXt"), tzone = ""), inc = c(NA, NA,
NA, NA, NA, NA, NA, 0.11, NA, 0.62, 0.0899999999999999, 0.39,
NA, NA, 0.03, NA, NA, NA, NA, NA, NA, 0.34, NA, NA, NA, NA, 0.0600000000000001,
0.02, NA, NA, NA, 0.29, 0.35, 0.02, 0.27, 0.17, 0.0600000000000001,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.47, NA, NA, NA, 0.0300000000000002
)), .Names = c("Time", "inc"), row.names = 50:100, class = "data.frame")
rainruns <- function(datas = data1) {
incs <- c(NA, datas$inc) # last column
event <- cumsum(is.na(incs[-length(incs)]) & !is.na(incs[-1])) # counter for rain events
datas <- cbind(datas, event) # add events column
datas2 <- datas[!is.na(datas$inc),] # delete na's
summarydata1 <- aggregate(datas2$inc, by = list(datas2$event), # summarize rain data by event
FUN = function(x) c(length(x), sum(x), mean(x)))[[2]]
summarydata2 <- aggregate(as.Date(datas2$Time), by = list(datas2$event), # summarize dates by event
FUN = function(x) c(min(x), max(x)))[[2]]
summarydata <- data.frame(format(as.Date(summarydata2[,1], # combine both, correcting date formats
origin = "1970-01-01"), "%m/%d/%Y"),
format(as.Date(summarydata2[,2],
origin = "1970-01-01"), "%m/%d/%Y"), summarydata1)
names(summarydata) <- c("Begin", "End", "Days", "Total", "Intensity") # update column names
return(summarydata)
}
Upvotes: 1
Reputation: 455
You can append a new column that group rows when they represent a continuous rainy period, then get the statistics you want using dplyr
. assuming that your dataframe is called df
:
library(dplyr)
rain_period = rep(NA,nrow(df)) #initialize vector
group=1 #initialize group number
for(i in 1:nrow(df)){
if(is.na(df$inc[i])) group = group + 1
else rain_period[i] = group
}
df$group = rain_period
result = dplyr::group_by(df,group)
result = dplyr::summarise(result,
Begin = min(Time),
End = max(Time),
Days = n(),
Total = sum(inc),
Intensity = mean(inc))
Upvotes: 1
Reputation: 9133
Here is an approach that uses dplyr
.
First, some preliminary cleanup: a date variable is needed, not a POSIXct:
library(dplyr)
df2 <- df %>%
mutate(date = as.Date(Time)) %>%
select(-Time)
This computes a data frame with an explicit variable for rain_event
:
df3 <- df2 %>%
filter(!is.na(inc)) %>%
mutate(
day_lag = as.numeric(difftime(date, lag(date), units = "days")),
# special case: first rain event
day_lag = ifelse(is.na(day_lag), 1, day_lag),
rain_event = 1 + cumsum(day_lag > 1)
)
> df3
inc date day_lag rain_event
1 0.11 2012-11-27 1 1
2 0.62 2012-11-29 2 2
3 0.09 2012-11-30 1 2
4 0.39 2012-12-01 1 2
5 0.03 2012-12-04 3 3
6 0.34 2012-12-11 7 4
7 0.06 2012-12-16 5 5
8 0.02 2012-12-17 1 5
9 0.29 2012-12-21 4 6
10 0.35 2012-12-22 1 6
11 0.02 2012-12-23 1 6
12 0.27 2012-12-24 1 6
13 0.17 2012-12-25 1 6
14 0.06 2012-12-26 1 6
15 0.47 2013-01-05 10 7
16 0.03 2013-01-09 4 8
Now, summarise by each rain event, computing the metrics you care about:
df3 %>%
group_by(rain_event) %>%
summarise(
begin = min(date),
end = max(date),
days = n(),
total = sum(inc),
intensity = mean(inc)
)
# A tibble: 8 × 6
rain_event begin end days total intensity
<dbl> <date> <date> <int> <dbl> <dbl>
1 1 2012-11-27 2012-11-27 1 0.11 0.1100000
2 2 2012-11-29 2012-12-01 3 1.10 0.3666667
3 3 2012-12-04 2012-12-04 1 0.03 0.0300000
4 4 2012-12-11 2012-12-11 1 0.34 0.3400000
5 5 2012-12-16 2012-12-17 2 0.08 0.0400000
6 6 2012-12-21 2012-12-26 6 1.16 0.1933333
7 7 2013-01-05 2013-01-05 1 0.47 0.4700000
8 8 2013-01-09 2013-01-09 1 0.03 0.0300000
Upvotes: 1
Reputation: 215117
data.table::rleid
is a convenient function for dealing with consecutive values, assuming your data frame is named df
and it has been sorted by Time variable before hand:
library(data.table)
setDT(df)
na.omit(df[,.(Begin = as.Date(first(Time)),
End = as.Date(last(Time)),
Days = as.Date(last(Time)) - as.Date(first(Time)) + 1,
Total = sum(inc), Intensity = mean(inc)),
by = .(id = rleid(is.na(inc)))])
# id Begin End Days Total Intensity
#1: 2 2012-11-27 2012-11-27 1 days 0.11 0.1100000
#2: 4 2012-11-29 2012-12-01 3 days 1.10 0.3666667
#3: 6 2012-12-04 2012-12-04 1 days 0.03 0.0300000
#4: 8 2012-12-11 2012-12-11 1 days 0.34 0.3400000
#5: 10 2012-12-16 2012-12-17 2 days 0.08 0.0400000
#6: 12 2012-12-21 2012-12-26 6 days 1.16 0.1933333 #I think you have some miscalculation here
#7: 14 2013-01-05 2013-01-05 1 days 0.47 0.4700000
#8: 16 2013-01-09 2013-01-09 1 days 0.03 0.0300000
Upvotes: 3