Felipe Rincón
Felipe Rincón

Reputation: 101

How to replace NA data of specific dates with the average data of different years of same dates of a dataframe in R?

If I have a dataframe A

A = 
year     month     day     hour     minute     rain
                        .
                        .
                        .
2000      01        01      01       00          2
2000      01        01      01       15          2
2000      01        01      01       30          NA
2000      01        01      01       45          3
2000      01        01      02       00          4
2000      01        01      02       15          5
                        .
                        .
                        .

Dataframe A have data from 1990 to 2000 with a frequency data of 15 minutes.

so A have the same dates and different years of specific missing data (NA):

A = 
year     month     day     hour     minute     rain
1990      01        01      01       30          10
                        .
                        .
                        .
1991      01        01      01       30          21
                        .
                        .
                        .
1992      01        01      01       30          4
                        .
                        .
                        .
1993      01        01      01       30          6
                        .
                        .
                        .
1994      01        01      01       30          10
                        .
                        .
                        .
1995      01        01      01       30          23
                        .
                        .
                        .
1996      01        01      01       30          0
                        .
                        .
                        .
1997      01        01      01       30          0
                        .
                        .
                        .
1998      01        01      01       30          0
                        .
                        .
                        .
1999      01        01      01       30          6
                        .
                        .
                        .
2000      01        01      01       30          NA

The idea is to look for every NA data in Rain column and replace them with the average of the same date of each year from 1990 to 2000.

The new A would be for 2000 year:

A = 
year     month     day     hour     minute     rain
                        .
                        .
                        .
2000      01        01      01       00          2
2000      01        01      01       15          2
2000      01        01      01       30          **8**
2000      01        01      01       45          3
2000      01        01      02       00          4
2000      01        01      02       15          5
                        .
                        .
                        .

Upvotes: 2

Views: 1050

Answers (4)

www
www

Reputation: 39174

A solution using . The idea is to summarize the rain in A by month, day, hour, and minute, join by these columns, and then replace NA in rain with the average rain value.

Notice that after the OP updated and clarified the original question, headpoint's answer is more simple and straightforward (https://stackoverflow.com/a/48313380/7669809). However, I still keep my answer here as an example to show that if the replacement values are from another data frame, the left_join approach would be useful.

library(dplyr)

A2 <- A %>%
  left_join(A %>%
              group_by(month, day, hour, minute) %>%
              summarise(mean_rain = mean(rain, na.rm = TRUE)), 
            by = c("month", "day", "hour", "minute")) %>%
  mutate(rain = ifelse(is.na(rain), mean_rain, rain)) %>%
  select(-mean_rain)
A2
#    year month day hour minute rain
# 1  1990     1   1    1     30   10
# 2  1991     1   1    1     30   21
# 3  1992     1   1    1     30    4
# 4  1993     1   1    1     30    6
# 5  1994     1   1    1     30   10
# 6  1995     1   1    1     30   23
# 7  1996     1   1    1     30    0
# 8  1997     1   1    1     30    0
# 9  1998     1   1    1     30    0
# 10 1999     1   1    1     30    6
# 11 2000     1   1    1      0    2
# 12 2000     1   1    1     15    2
# 13 2000     1   1    1     30    8
# 14 2000     1   1    1     45    3
# 15 2000     1   1    2      0    4
# 16 2000     1   1    2     15    5

DATA

A <- read.table(text =  "year     month     day     hour     minute     rain
1990      01        01      01       30          10
1991      01        01      01       30          21
1992      01        01      01       30          4
1993      01        01      01       30          6
1994      01        01      01       30          10
1995      01        01      01       30          23
1996      01        01      01       30          0
1997      01        01      01       30          0
1998      01        01      01       30          0
1999      01        01      01       30          6
2000      01        01      01       00          2
2000      01        01      01       15          2
2000      01        01      01       30          NA
2000      01        01      01       45          3
2000      01        01      02       00          4
2000      01        01      02       15          5",
                header = TRUE, stringsAsFactors = FALSE)

Upvotes: 2

Hugh
Hugh

Reputation: 16099

You can coalesce the rain column to the average (without NA) by year, month, day:

library(data.table)
library(hutils)

setDT(A)[, 
         rain := coalesce(rain, mean(rain, na.rm = TRUE)),
         keyby = c("year", "month", "day")]

There is a nearly identical function from dplyr if you don't want to install hutils, though hutils::coalesce is faster:

 copy(A)[, `:=`(rain2, hutils::coalesce(rain, mean(rain, na.rm = TRUE))),      keyby = c("year", "month", "day")]
  copy(A)[, `:=`(rain2, dplyr::coalesce(rain, mean(rain, na.rm = TRUE))),      keyby = c("year", "month", "day")]
      min       lq     mean   median       uq      max neval cld
 107.1927 118.8372 132.8129 121.4076 128.0383  634.658   100  a 
 887.1115 921.6607 960.4542 943.0711 980.5143 1145.658   100   b

Generate sample file:

A <- 
  CJ(year = 1990:2000,
     month = 1:12,
     day = 1:31,  # yeah I know not all months have 31 days; doesn't matter here
     hour = 0:23,
     minute = 15L * c(0:3))
A[, rain := NA_real_]
A[sample.int(nrow(A), size = 0.8 * nrow(A)), rain := rexp(0.8 * nrow(A), rate = 2)]

Upvotes: 1

Hlynur
Hlynur

Reputation: 335

Here's a dplyr method that uses only the means from the same time from previous years. That is, as opposed to solutions provided by headpoint and www, this will not use values for 01:30, January 1st 2001 or 2002 (etc) to calculate the mean used for 01:30, January 1st 2000, but rather the mean rain value on 01:30, January 1st for all the years before 2000.

So first, we simply recreate the A dataframe for which to work with. I only load the tibble package to use the tribble function to recreate your dataframe example.

library(dplyr)
library(tibble)

A <- tribble(
~ year, ~month, ~day, ~hour, ~minute, ~rain,
1990,   01,     01,   01,    30,      10,
1991,   01,     01,   01,    30,      21,
1992,   01,     01,   01,    30,      4,
1993,   01,     01,   01,    30,      6,
1994,   01,     01,   01,    30,      10,
1995,   01,     01,   01,    30,      23,
1996,   01,     01,   01,    30,      0,
1997,   01,     01,   01,    30,      0,
1998,   01,     01,   01,    30,      0,
2000,   01,     01,   01,    00,      2,
2000,   01,     01,   01,    15,      2,
2000,   01,     01,   01,    30,      NA,
2000,   01,     01,   01,    45,      3,
2000,   01,     01,   02,    00,      4,
2000,   01,     01,   02,    15,      5)

And then what we do is group by month, day, hour and minute so that we can create a running mean for that time of that day in previous years. The most recent of which will then be used if there's an NA value the following year.

A <- A %>% 
  group_by(month, day, hour, minute) %>% 
  mutate(running_mean = round(cummean(rain), 0),
         most_recent_mean = lag(running_mean),
         rain = if_else(is.na(rain), most_recent_mean, rain)) %>%
  select(-running_mean, -most_recent_mean)

And that should do the trick. The printed result is as follows:

# A tibble: 15 x 6
# Groups:   month, day, hour, minute [6]
    year month   day  hour minute  rain
   <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
 1  1990     1     1     1     30    10
 2  1991     1     1     1     30    21
 3  1992     1     1     1     30     4
 4  1993     1     1     1     30     6
 5  1994     1     1     1     30    10
 6  1995     1     1     1     30    23
 7  1996     1     1     1     30     0
 8  1997     1     1     1     30     0
 9  1998     1     1     1     30     0
10  2000     1     1     1      0     2
11  2000     1     1     1     15     2
12  2000     1     1     1     30     8
13  2000     1     1     1     45     3
14  2000     1     1     2      0     4
15  2000     1     1     2     15     5 

Upvotes: 1

kangaroo_cliff
kangaroo_cliff

Reputation: 6222

This might work. It groups the months and days pairs and the replace the NAs from the mean.

library(dplyr)
A <- A %>% 
       group_by(month, day, hour, minute) %>% 
       mutate(rain = ifelse(is.na(rain), 
                              mean(rain, na.rm=TRUE), rain))

Upvotes: 2

Related Questions