Jawairia
Jawairia

Reputation: 305

How to calculate aggregate statistics on a dataframe in R by applying conditions on time values?

I am working on climate data analysis. After loading file in R, my interest is to subset data based upon hours in a day.

for time analysis we can use $hour with the variable in which time vector has been stored if our interest is to deal with hours.

I want to subset my data for each hour in a day for 365 days and then take an average of the data at a particular hour throughout the year. Say I am interested to take values of irradiation/wind speed etc at 12:OO PM for a year and then take mean of these values to get the desired result.

I know how to subset a data frame based upon conditions. If for example my data is in a matrix called data and contains 2 rows say time and wind speed and I'm interested to subset rows of data in which irradiationb isn't zero. We can do this using the following code

my_data <- subset(data, data[,1]>0)

but now in order to deal with hours values in time column which is a variable stored in data, how can I subset values?

My data look like this:

enter image description here

I hope I made sense in this question.

Thanks in advance!

Upvotes: 2

Views: 133

Answers (1)

Florian
Florian

Reputation: 25415

Here is a possible solution. You can create a hourly grouping with format(df$time,'%H'), so we obtain only the hour for each period, we can then simply group by this new column and calculate the mean for each group.

df = data.frame(time=seq(Sys.time(),Sys.time()+2*60*60*24,by='hour'),val=sample(seq(5),49,replace=T))

library(dplyr)

df %>% mutate(hour=format(df$time,'%H')) %>%
  group_by(hour) %>%
  summarize(mean_val = mean(val))

To subset the non-zero values first, you can do either:

df = subset(df,val!=0)

or start the dplyr chain with:

df %>% filter(df$val!=0)

Hope this helps!


df looks as follows:

                  time val
1  2018-01-31 12:43:33   4
2  2018-01-31 13:43:33   2
3  2018-01-31 14:43:33   2
4  2018-01-31 15:43:33   3
5  2018-01-31 16:43:33   3
6  2018-01-31 17:43:33   1
7  2018-01-31 18:43:33   2
8  2018-01-31 19:43:33   4
...    ...       ...     ...

And the output:

# A tibble: 24 x 2
   hour  mean_val
   <chr>    <dbl>
 1 00        3.50
 2 01        3.50
 3 02        4.00
 4 03        2.50
 5 04        3.00
 6 05        2.00
 ....        ....

This assumes your time column is already of class POSIXct, otherwise you'd first have to convert it using for example as.POSIXct(x,format='%Y-%m-%d %H:%M:%S')

Upvotes: 1

Related Questions