Reputation: 3
I want to find how much daily temperatures differ to a specified monthly mean at various locations. I was thinking of having this a percentage difference value. For example, the specified monthly mean is 20 and I have some days that are 15(-25%), 25(+25%) and 10 (-50%).
The only way I can think of is to create a repeating column with the monthly means at each location and then using the diff function or a percentage difference formula calculate the difference between the columns. I was wondering if there is a more elegant and easier way to do this that would be suitable for big data?
I then want to use this daily trend or difference and apply this to a different set of monthly means to break this into daily data. For example, say the monthly mean is 10 and I have a few day trends that are +25% (12.5), -50% (5) and -25% (7.5). Again, is there an elegant or easier way to this?
Any help would be appreciated. I am still quite new to R!
Here is some sample data:
date <- c("2009-01-01", "2009-01-02", "2009-01-03", "2009-01-04","2009-01-05",
"2009-01-01", "2009-01-02", "2009-01-03", "2009-01-04","2009-01-05",
"2009-01-01", "2009-01-02", "2009-01-03", "2009-01-04","2009-01-05")
location <- c("A", "A", "A", "A", "A",
"B", "B", "B", "B", "B",
"C", "C", "C", "C", "C")
daily_temp <- c(10, 12, 12, 9, 8,
13, 14, 18, 8, 11,
14, 18, 20, 16, 17)
data_daily <- cbind(date, location, daily_temp)
mean_monthly <- c(12, 14, 16)
location_monthly <- c("A", "B", "C")
data_monthly <- cbind(mean_monthly, location_monthly)
Upvotes: 0
Views: 317
Reputation: 27732
Make your source data in the correct format for analysis
df.daily <- as.data.frame( data_daily, stringsAsFactors = F)
df.monthly <- as.data.frame( data_monthly, stringsAsFactors = F)
library( tidyverse )
df.daily <- as.data.frame( data_daily, stringsAsFactors = FALSE )
df.monthly <- as.data.frame( data_monthly, stringsAsFactors = FALSE )
df.daily %>%
left_join( df.monthly, by = c( "location" = "location_monthly" ) ) %>%
mutate( daily_temp = as.numeric( daily_temp ) ) %>%
mutate( mean_monthly = as.numeric( mean_monthly ) ) %>%
mutate( delta_temp = ( daily_temp - mean_monthly ) / mean_monthly )
# date location daily_temp mean_monthly delta_temp
# 1 2009-01-01 A 10 12 -0.16666667
# 2 2009-01-02 A 12 12 0.00000000
# 3 2009-01-03 A 12 12 0.00000000
# 4 2009-01-04 A 9 12 -0.25000000
# 5 2009-01-05 A 8 12 -0.33333333
# 6 2009-01-01 B 13 14 -0.07142857
# 7 2009-01-02 B 14 14 0.00000000
# 8 2009-01-03 B 18 14 0.28571429
# 9 2009-01-04 B 8 14 -0.42857143
# 10 2009-01-05 B 11 14 -0.21428571
# 11 2009-01-01 C 14 16 -0.12500000
# 12 2009-01-02 C 18 16 0.12500000
# 13 2009-01-03 C 20 16 0.25000000
# 14 2009-01-04 C 16 16 0.00000000
# 15 2009-01-05 C 17 16 0.06250000
#less readable but usually faster , especially on larger datasets
library( data.table )
setDT( df.monthly )[, mean_monthly := as.numeric( mean_monthly )][setDT( df.daily )[, daily_temp := as.numeric( daily_temp )], on = c( "location_monthly==location" )][, delta_temp := ( daily_temp - mean_monthly ) / mean_monthly ][]
data.table has a slight advantage
microbenchmark::microbenchmark( tidyverse = {df.daily %>%
left_join( df.monthly, by = c( "location" = "location_monthly" ) ) %>%
mutate( daily_temp = as.numeric( daily_temp ) ) %>%
mutate( mean_monthly = as.numeric( mean_monthly ) ) %>%
mutate( delta_temp = ( daily_temp - mean_monthly ) / mean_monthly )},
data.table = {setDT(df.monthly)[, mean_monthly := as.numeric( mean_monthly )][setDT(df.daily)[, daily_temp := as.numeric( daily_temp )], on = c( "location_monthly==location" )][, delta_temp := ( daily_temp - mean_monthly ) / mean_monthly ][]},
times = 100)
# Unit: milliseconds
# expr min lq mean median uq max neval
# tidyverse 2.318527 2.408303 2.579056 2.454999 2.513293 13.104373 100
# data.table 1.515959 1.590221 1.669511 1.643545 1.702141 2.345037 100
Upvotes: 1
Reputation: 66445
Building on @Wimpel's response, here are some ways to summarize location variance.
df.combo <-
df.daily%>%
left_join( df.monthly, by = c( "location" = "location_monthly" ) ) %>%
mutate( daily_temp = as.numeric( daily_temp ) ) %>%
mutate( mean_monthly = as.numeric( mean_monthly ) ) %>%
mutate( delta_temp = ( daily_temp - mean_monthly ) / mean_monthly ) %>%
# Here I add the difference in degrees between daily temp and monthly avg temp
mutate( temp_dif = daily_temp - mean_monthly)
# For each location, what are some stats about those temp_dif values?
df.loc.stats <-
df.combo %>%
group_by(location) %>%
summarize(mean_dif = mean(temp_dif),
mean_abs_dif = mean(abs(temp_dif)),
SD_dif = sd(temp_dif))
The df.loc.stats
table shows that location B had the most varied temperatures (measured either using mean absolute difference or standard deviation, for example), while A was lowest under avg temp and C was highest:
df.loc.stats
# A tibble: 3 x 4
location mean_dif mean_abs_dif SD_dif
<chr> <dbl> <dbl> <dbl>
1 A -1.8 1.8 1.79
2 B -1.2 2.8 3.70
3 C 1 1.8 2.24
Upvotes: 1