Jan Blanke
Jan Blanke

Reputation: 181

Data aggregation loop in R

I am facing a problem concerning aggregating my data to daily data. I have a data frame where NAs have been removed (Link of picture of data is given below). Data has been collected 3 times a day, but sometimes due to NAs, there is just 1 or 2 entries per day; some days data is missing completely.

I am now interested in calculating the daily mean of "dist": this means summing up the data of "dist" of one day and dividing it by number of entries per day (so 3 if there is no data missing that day). I would like to do this via a loop. How can I do this with a loop? The problem is that sometimes I have 3 entries per day and sometimes just 2 or even 1. I would like to tell R that for every day, it should sum up "dist" and divide it by the number of entries that are available for every day.

I just have no idea how to formulate a for loop for this purpose. I would really appreciate if you could give me any advice on that problem. Thanks for your efforts and kind regards,

Jan

Data frame: http://www.pic-upload.de/view-11435581/Data_loop.jpg.html

Edit: I used aggregate and tapply as suggested, however, the mean value of the data was not really calculated:

              Group.1         x
1  2006-10-06 12:00:00  636.5395
2  2006-10-06 20:00:00  859.0109
3  2006-10-07 04:00:00  301.8548
4  2006-10-07 12:00:00  649.3357
5  2006-10-07 20:00:00  944.8272
6  2006-10-08 04:00:00  136.7393
7  2006-10-08 12:00:00  360.9560
8  2006-10-08 20:00:00       NaN

The code used was:

dates<-Dis_sub$date
distance<-Dis_sub$dist
aggregate(distance,list(dates),mean,na.rm=TRUE)
tapply(distance,dates,mean,na.rm=TRUE)

Upvotes: 1

Views: 1582

Answers (3)

Peter M
Peter M

Reputation: 844

It looks like your main problem is that your date field has times attached. The first thing you need to do is create a column that has just the date using something like

Dis_sub$date_only <- as.Date(Dis_sub$date)

Then using Joris Meys' solution (which is the right way to do it) should work.

However if for some reason you really want to use a loop you could try something like

newFrame <- data.frame()
for d in unique(Dis_sub$date){
    meanDist <- mean(Dis_sub$dist[Dis_sub$date==d],na.rm=TRUE)
    newFrame <- rbind(newFrame,c(d,meanDist))
}

But keep in mind that this will be slow and memory-inefficient.

Upvotes: 1

Ramnath
Ramnath

Reputation: 55695

Look at the data.table package especially if your data is huge. Here is some code that calculates the mean of dist by day.

library(data.table)
dt = data.table(Data)
Data[,list(avg_dist = mean(dist, na.rm = T)),'date']

Upvotes: 2

Joris Meys
Joris Meys

Reputation: 108533

Don't use a loop. Use R. Some example data :

dates <- rep(seq(as.Date("2001-01-05"),
                 as.Date("2001-01-20"),
                 by="day"),
             each=3)
values <- rep(1:16,each=3)
values[c(4,5,6,10,14,15,30)] <- NA

and any of :

aggregate(values,list(dates),mean,na.rm=TRUE)

tapply(values,dates,mean,na.rm=TRUE)

gives you what you want. See also ?aggregate and ?tapply.

If you want a dataframe back, you can look at the package plyr :

Data <- as.data.frame(dates,values)
require(plyr)

ddply(data,"dates",mean,na.rm=TRUE)

Keep in mind that ddply is not fully supporting the date format (yet).

Upvotes: 6

Related Questions