Brant Mullinix
Brant Mullinix

Reputation: 137

Replacing NA values with the mean value aggregated over an interval

I have two data frames. The first is the full data set which includes NA values for the 'step' variable. The data frame has three variables steps, date, and interval (which is a five minute interval in the day values 0-2355 increasing by 5). The second data frame is the mean value of steps for each interval. To reproduce the data frames use the following code:

#dat <- read.csv("activity.csv")
dat <- data.frame(steps = c(NA,16,5,3,8,NA),
                          date=c("2012-10-01","2012-10-01","2012-10-02",
                                 "2012-10-02","2012-10-03","2012-10-03"),
                          interval = c(0,5,0,5,0,5))
dat$date <- as.Date(dat$date, format='%Y-%m-%d')
steps_by_interval_df <- aggregate(steps ~ interval, dat[complete.cases(dat),], mean)

What I would like to do now is replace the NA values in data with the mean steps calculated in the steps_by_interval_df so I did the following:

missing_steps_vect <- is.na(dat$steps)
dat$steps[missing_steps_vect] <- 
  steps_by_interval_df$steps[
    which(dat$interval[missing_steps_vect] == steps_by_interval_df$interval)]

This part works! All of the NA values are replaced by the mean that I calculated for that interval. This was my proof of concept to myself so that I could make sure the function I wrote works as planned.

The problem is that if I replace the first line of code with my actual csv read code (see commented out line) then not all of the NA values are replaced. This only seems to replace the first chunk of NA values and not all of them. I start with about 2300 NA values, after running the function I still have about 2100 where I would expect 0. Why does the code work for the data frame I created but not for the one I get from read.csv?

If you plan to recreate the problem you will need to unzip the file from here and point to the csv file for the read csv file.

Disclaimer: This is for a class I am taking. I could probably do this with a for loop easily just to make it work but I would prefer to learn why this does not work instead of just doing something different.

Thanks.

Upvotes: 2

Views: 256

Answers (1)

Jaap
Jaap

Reputation: 83275

Using ave in combination with na.aggregate from the zoo package will give you the desired result and is a lot easier than creating a separate function:

library(zoo)
dat <- read.csv("activity.csv")
dat$date <- as.Date(dat$date, format='%Y-%m-%d')
dat$steps <- ave(dat$steps, dat$interval, FUN=na.aggregate)

Upvotes: 1

Related Questions