Reputation: 29
We have two data frames. station_data
contains weather observations at the geography-day level. tavg_monthly
contains the quantiles of tvag
at the geography-month level. We want to create a dummy variable equal to TRUE
if the observation in station_data
is greater or equal to the 75% percentile or smaller than the 25% percentile (stored in tavg_monthly
as tavg_monthly$75%
or tavg_monthly$75%
) indicating "extreme weather". Observations are grouped by fips
and month
.
Example station_data:
structure(list(fips = c("01073", "01073", "01073", "01073", "01073",
"01073"), rain = c(0, 0, 0, 0, 0, 0), year = c("1980", "1980",
"1980", "1980", "1980", "1980"), week = c(1L, 1L, 1L, 1L, 1L,
1L), month = c("01", "01", "01", "01", "01", "01"), day = c("001",
"002", "003", "004", "005", "006"), tavg = c(3.32500010728836,
4.64999985694885, 7.77500009536743, 4.3125, 0, 1.86249995231628
)), row.names = c(NA, 6L), class = "data.frame")
Example prcp_monthly:
structure(list(fips = c("01073", "01073", "01073", "01073", "01073",
"01073"), month = c("01", "02", "03", "04", "05", "06"),
`25%` = c(2.68333338201046,
4.65000009536743, 8.86249977350235, 13.8229166865349, 18.7999997138977,
23.7364585399628), `75%` = c(9.79999996721745, 12.1333334445953,
16.3260417580605, 20.1833333969116, 23.6843748092651, 26.5312495231628
), n = c(1116L, 1017L, 1116L, 1080L, 1116L, 1080L)), row.names = c(NA,
6L), class = "data.frame")
Using the following line
setDT(station_data)[, extr_tavg_monthly := station_data$tavg>=prcp_monthly$`75%` | output$tavg<=input$`25%` , by = list(fips, month)]
I get an extra column with results, however, they are inconsistent (i.e. sometimes wrong). I receive over 50 warnings of the form
In `[.data.table`(setDT(station_data), , `:=`(extr_prcp_monthly, ...:
RHS 1 is length (greater than the size (1116) of group 25). The
last 35868 element(s) will be discarded.
where 35868 / 12 months = 3082 (number of my geographical units) and 1116 obs. = 36 years of data * 31 days (e.g. in January) in the full data set.
The result is:
fips rain year week month day tavg extr_tavg_monthly
1: 01073 0 1980 1 01 001 3.3250 FALSE
2: 01073 0 1980 1 01 002 4.6500 TRUE
3: 01073 0 1980 1 01 003 7.7750 TRUE
4: 01073 0 1980 1 01 004 4.3125 TRUE
5: 01073 0 1980 1 01 005 0.0000 TRUE
6: 01073 0 1980 1 01 006 1.8625 TRUE
It should be however,
fips rain year week month day tavg extr_tavg_monthly
1: 01073 0 1980 1 01 001 3.3250 FALSE
2: 01073 0 1980 1 01 002 4.6500 FALSE
3: 01073 0 1980 1 01 003 7.7750 FALSE
4: 01073 0 1980 1 01 004 4.3125 FALSE
5: 01073 0 1980 1 01 005 0.0000 TRUE
6: 01073 0 1980 1 01 006 1.8625 TRUE
given that the quartiles for month=01
and fips=01073
are
fips month 25% 75% n
1 01073 01 2.683333 9.8 1116
Upvotes: 0
Views: 80
Reputation: 42564
Alternatively, this can be solved using a "non-equi update join":
library(data.table)
setDT(station_data)[setDT(prcp_monthly),
on = .(fips, month, tavg >= `25%`, tavg < `75%`),
extr_tavg_monthly := FALSE][
is.na(extr_tavg_monthly), extr_tavg_monthly := TRUE][]
fips rain year week month day tavg extr_tavg_monthly 1: 01073 0 1980 1 01 001 3.3250 FALSE 2: 01073 0 1980 1 01 002 4.6500 FALSE 3: 01073 0 1980 1 01 003 7.7750 FALSE 4: 01073 0 1980 1 01 004 4.3125 FALSE 5: 01073 0 1980 1 01 005 0.0000 TRUE 6: 01073 0 1980 1 01 006 1.8625 TRUE
Please, note that besides extr_tavg_monthly
no other columns have been added to the station dataset. This is in contrast to this answer which adds also the 25%
and 75%
columns to station_data
.
If I understand correctly from OP's comment, it is required that extr_tavg_monthly
should be NA
in case tavg
is missing. This can be achieved by a slight modification.
# create 2nd dataset by appending an additional row containing NA
station_data2 <- rbind(setDT(station_data), station_data[.N])
station_data2[.N, `:=`(day = "007", tavg = NA)]
station_data2
fips rain year week month day tavg 1: 01073 0 1980 1 01 001 3.3250 2: 01073 0 1980 1 01 002 4.6500 3: 01073 0 1980 1 01 003 7.7750 4: 01073 0 1980 1 01 004 4.3125 5: 01073 0 1980 1 01 005 0.0000 6: 01073 0 1980 1 01 006 1.8625 7: 01073 0 1980 1 01 007 NA
station_data2[setDT(prcp_monthly),
on = .(fips, month, tavg >= `25%`, tavg < `75%`),
extr_tavg_monthly := FALSE][
is.na(extr_tavg_monthly) & !is.na(tavg), extr_tavg_monthly := TRUE]
station_data2
fips rain year week month day tavg extr_tavg_monthly 1: 01073 0 1980 1 01 001 3.3250 FALSE 2: 01073 0 1980 1 01 002 4.6500 FALSE 3: 01073 0 1980 1 01 003 7.7750 FALSE 4: 01073 0 1980 1 01 004 4.3125 FALSE 5: 01073 0 1980 1 01 005 0.0000 TRUE 6: 01073 0 1980 1 01 006 1.8625 TRUE 7: 01073 0 1980 1 01 007 NA NA
Upvotes: 1
Reputation: 29
What works is merging in the quartiles, so I guess the reason comes from the missmatch in length as given in the warning messages.
setDT(station_data)[setDT(tavg_monthly), `25%` := `25%`, on=c("fips", "month")]
setDT(station_data)[setDT(tavg_monthly), `75%` := `75%`, on=c("fips", "month")]
setDT(station_data)[, extr_tavg_monthly :=tavg>=`75%` | tavg<=`25%`, by = list(fips, month)]
Upvotes: 0