Reputation: 101
If I have a dataframe A
A =
year month day hour minute rain
.
.
.
2000 01 01 01 00 2
2000 01 01 01 15 2
2000 01 01 01 30 NA
2000 01 01 01 45 3
2000 01 01 02 00 4
2000 01 01 02 15 5
.
.
.
Dataframe A have data from 1990 to 2000 with a frequency data of 15 minutes.
so A have the same dates and different years of specific missing data (NA):
A =
year month day hour minute rain
1990 01 01 01 30 10
.
.
.
1991 01 01 01 30 21
.
.
.
1992 01 01 01 30 4
.
.
.
1993 01 01 01 30 6
.
.
.
1994 01 01 01 30 10
.
.
.
1995 01 01 01 30 23
.
.
.
1996 01 01 01 30 0
.
.
.
1997 01 01 01 30 0
.
.
.
1998 01 01 01 30 0
.
.
.
1999 01 01 01 30 6
.
.
.
2000 01 01 01 30 NA
The idea is to look for every NA data in Rain column and replace them with the average of the same date of each year from 1990 to 2000.
The new A would be for 2000 year:
A =
year month day hour minute rain
.
.
.
2000 01 01 01 00 2
2000 01 01 01 15 2
2000 01 01 01 30 **8**
2000 01 01 01 45 3
2000 01 01 02 00 4
2000 01 01 02 15 5
.
.
.
Upvotes: 2
Views: 1050
Reputation: 39174
A solution using dplyr. The idea is to summarize the rain in A
by month, day, hour, and minute, join by these columns, and then replace NA
in rain
with the average rain value.
Notice that after the OP updated and clarified the original question, headpoint's answer is more simple and straightforward (https://stackoverflow.com/a/48313380/7669809). However, I still keep my answer here as an example to show that if the replacement values are from another data frame, the left_join
approach would be useful.
library(dplyr)
A2 <- A %>%
left_join(A %>%
group_by(month, day, hour, minute) %>%
summarise(mean_rain = mean(rain, na.rm = TRUE)),
by = c("month", "day", "hour", "minute")) %>%
mutate(rain = ifelse(is.na(rain), mean_rain, rain)) %>%
select(-mean_rain)
A2
# year month day hour minute rain
# 1 1990 1 1 1 30 10
# 2 1991 1 1 1 30 21
# 3 1992 1 1 1 30 4
# 4 1993 1 1 1 30 6
# 5 1994 1 1 1 30 10
# 6 1995 1 1 1 30 23
# 7 1996 1 1 1 30 0
# 8 1997 1 1 1 30 0
# 9 1998 1 1 1 30 0
# 10 1999 1 1 1 30 6
# 11 2000 1 1 1 0 2
# 12 2000 1 1 1 15 2
# 13 2000 1 1 1 30 8
# 14 2000 1 1 1 45 3
# 15 2000 1 1 2 0 4
# 16 2000 1 1 2 15 5
DATA
A <- read.table(text = "year month day hour minute rain
1990 01 01 01 30 10
1991 01 01 01 30 21
1992 01 01 01 30 4
1993 01 01 01 30 6
1994 01 01 01 30 10
1995 01 01 01 30 23
1996 01 01 01 30 0
1997 01 01 01 30 0
1998 01 01 01 30 0
1999 01 01 01 30 6
2000 01 01 01 00 2
2000 01 01 01 15 2
2000 01 01 01 30 NA
2000 01 01 01 45 3
2000 01 01 02 00 4
2000 01 01 02 15 5",
header = TRUE, stringsAsFactors = FALSE)
Upvotes: 2
Reputation: 16099
You can coalesce
the rain column to the average (without NA
) by year
, month
, day
:
library(data.table)
library(hutils)
setDT(A)[,
rain := coalesce(rain, mean(rain, na.rm = TRUE)),
keyby = c("year", "month", "day")]
There is a nearly identical function from dplyr
if you don't want to install hutils
, though hutils::coalesce
is faster:
copy(A)[, `:=`(rain2, hutils::coalesce(rain, mean(rain, na.rm = TRUE))), keyby = c("year", "month", "day")]
copy(A)[, `:=`(rain2, dplyr::coalesce(rain, mean(rain, na.rm = TRUE))), keyby = c("year", "month", "day")]
min lq mean median uq max neval cld
107.1927 118.8372 132.8129 121.4076 128.0383 634.658 100 a
887.1115 921.6607 960.4542 943.0711 980.5143 1145.658 100 b
Generate sample file:
A <-
CJ(year = 1990:2000,
month = 1:12,
day = 1:31, # yeah I know not all months have 31 days; doesn't matter here
hour = 0:23,
minute = 15L * c(0:3))
A[, rain := NA_real_]
A[sample.int(nrow(A), size = 0.8 * nrow(A)), rain := rexp(0.8 * nrow(A), rate = 2)]
Upvotes: 1
Reputation: 335
Here's a dplyr
method that uses only the means from the same time from previous years. That is, as opposed to solutions provided by headpoint and www, this will not use values for 01:30, January 1st 2001 or 2002 (etc) to calculate the mean used for 01:30, January 1st 2000, but rather the mean rain value on 01:30, January 1st for all the years before 2000.
So first, we simply recreate the A dataframe for which to work with. I only load the tibble package to use the tribble
function to recreate your dataframe example.
library(dplyr)
library(tibble)
A <- tribble(
~ year, ~month, ~day, ~hour, ~minute, ~rain,
1990, 01, 01, 01, 30, 10,
1991, 01, 01, 01, 30, 21,
1992, 01, 01, 01, 30, 4,
1993, 01, 01, 01, 30, 6,
1994, 01, 01, 01, 30, 10,
1995, 01, 01, 01, 30, 23,
1996, 01, 01, 01, 30, 0,
1997, 01, 01, 01, 30, 0,
1998, 01, 01, 01, 30, 0,
2000, 01, 01, 01, 00, 2,
2000, 01, 01, 01, 15, 2,
2000, 01, 01, 01, 30, NA,
2000, 01, 01, 01, 45, 3,
2000, 01, 01, 02, 00, 4,
2000, 01, 01, 02, 15, 5)
And then what we do is group by month, day, hour and minute so that we can create a running mean for that time of that day in previous years. The most recent of which will then be used if there's an NA
value the following year.
A <- A %>%
group_by(month, day, hour, minute) %>%
mutate(running_mean = round(cummean(rain), 0),
most_recent_mean = lag(running_mean),
rain = if_else(is.na(rain), most_recent_mean, rain)) %>%
select(-running_mean, -most_recent_mean)
And that should do the trick. The printed result is as follows:
# A tibble: 15 x 6
# Groups: month, day, hour, minute [6]
year month day hour minute rain
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1990 1 1 1 30 10
2 1991 1 1 1 30 21
3 1992 1 1 1 30 4
4 1993 1 1 1 30 6
5 1994 1 1 1 30 10
6 1995 1 1 1 30 23
7 1996 1 1 1 30 0
8 1997 1 1 1 30 0
9 1998 1 1 1 30 0
10 2000 1 1 1 0 2
11 2000 1 1 1 15 2
12 2000 1 1 1 30 8
13 2000 1 1 1 45 3
14 2000 1 1 2 0 4
15 2000 1 1 2 15 5
Upvotes: 1
Reputation: 6222
This might work. It groups the months and days pairs and the replace the NA
s from the mean.
library(dplyr)
A <- A %>%
group_by(month, day, hour, minute) %>%
mutate(rain = ifelse(is.na(rain),
mean(rain, na.rm=TRUE), rain))
Upvotes: 2