RDB57
RDB57

Reputation: 11

computing means in time series data from non-zero entries

I have a data frame of 400+ tag names with daily time series data for each tag. The values of some of tags are 0 where I do not have readings. I would like to calculate a mean of the non-zero amounts for each tag in the time series and fill the zero values with that mean.

example

tag1   day1   400  
tag1   day2   200  
tag1   day3   0  
.  
.  
tag1  dayn    0  
tag2  day1    0  
tag2  day2    100  
tag2  day3    0  
...  

Here I would like to fill the 0 values for tag2 with 100 and the 0 values for tag1 with 300

I can use ddply to subset the data frame and calculate means but I am looking for the best way to get means based on the non-zero entries for each tag and then fill the original zero values in the data frame with the mean values per tag. There seems to be a few ways to do this a bunch of lines of code but I suspect there much be a faster/more elegant approach. There are 400-500 tags in the data and about 150 daily readings

Upvotes: 1

Views: 96

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 269694

Here are some approaches:

1) sqldf This was moved from comments. The code below selects tag, day and one of two values for Amount. For each row, if Amount is 0 then it runs the inner correlated select and otherwise it uses the Amount value.

library(sqldf)
sqldf("select 
           tag, 
           day, 
           case when a.Amount = 0 
                then (select avg(b.Amount) 
                      from df1 b 
                      where b.Amount != 0 and b.tag = a.tag) 
                else a.Amount 
                end Amount 
           from df1 a") 

giving:

   tag  day Amount
1 tag1 day1    400
2 tag1 day2    200
3 tag1 day3    300
4 tag1 dayn    300
5 tag2 day1    100
6 tag2 day2    100
7 tag2 day3    100

2) na.aggregate Replace the zero values with NA and then apply na.aggregate from zoo to each group using ave:

library(zoo)
transform(df1, Amount = ave(replace(Amount, Amount == 0, NA), tag, FUN = na.aggregate))

Note We used the following as input:

df1 <- structure(list(tag = c("tag1", "tag1", "tag1", "tag1", "tag2", 
"tag2", "tag2"), day = c("day1", "day2", "day3", "dayn", "day1", 
"day2", "day3"), Amount = c(400L, 200L, 0L, 0L, 0L, 100L, 0L)), .Names = c("tag", 
"day", "Amount"), class = "data.frame", row.names = c(NA, -7L))

Update: Added (2).

Upvotes: 2

akrun
akrun

Reputation: 887213

For large datasets, it may be efficient to use either data.table or dplyr.

Using data.table, we convert the 'data.frame' to 'data.table' (setDT(df1)), to avoid a clash for possible mismatch between the class of 'Amount' and the mean value (which would be 'numeric' class), we can first change 'Amount' to 'numeric' class (Amount := as.numeric(Amount)), create a 'Mean' column by getting the mean of all non-zero 'Amount' values grouped by 'tag', replace the values that are '0' for Amount (Amount==0) with the corresponding value from 'Mean' column (Amount := Mean), and if needed, we can remove the 'Mean' column by assigning to 'NULL'

 library(data.table)
 setDT(df1)[, Amount:= as.numeric(Amount)
   ][, Mean:= mean(Amount[Amount!=0]), by= tag
   ][Amount==0, Amount := Mean][, Mean:= NULL]

Or using dplyr, we can group by 'tag' and use replace to change the '0' values to mean

 library(dplyr)
 df1 %>%
     group_by(tag)%>%
     mutate(Amount= replace(Amount, which(Amount==0),
         mean(Amount[Amount!=0])))

Or a possible sqldf solution

 library(sqldf)
 res1 <- sqldf("select * from df1 
           left join (select tag,
                     avg(Amount) as Mean 
                     from df1 
                     where Amount is not 0
                     group by tag) 
                      using (tag)")
  sqldf("select tag, day,
          case when Amount like 0 
               then Mean
               else Amount 
               end Amount
         from res1")

Update

As mentioned by @Frank (in the comments), if a particular 'tag' group have only 0 as 'Amount', then the previous solution will get 'NaN' for the mean. I am not sure what the expected value will be in that case. Suppose if we need to keep that as 0, the change in code (with inclusion of na.rm=TRUE for the possible NA values)

  df2 <- rbind(df1,list("tag3","day3",0))
  setDT(df2)[, Amount := as.numeric(Amount)
  ][,  Mean:= if(all(Amount==0)) 0 
              else mean(Amount[Amount!=0], na.rm=TRUE), by = tag
  ][Amount==0, Amount:= Mean][, Mean:= NULL]

data

df1 <- structure(list(tag = c("tag1", "tag1", "tag1", "tag1", "tag2", 
"tag2", "tag2"), day = c("day1", "day2", "day3", "dayn", "day1", 
"day2", "day3"), Amount = c(400L, 200L, 0L, 0L, 0L, 100L, 0L)),
.Names = c("tag", 
"day", "Amount"), class = "data.frame", row.names = c(NA, -7L))

Upvotes: 2

bgoldst
bgoldst

Reputation: 35314

Here's a possible solution using ave():

set.seed(2); NT <- 4; ND <- 4; df <- data.frame(tag=rep(paste0('tag',1:NT),each=ND),day=rep(paste0('day',1:ND),NT),amount=c(sample(seq(0,400,100),ND*(NT-1),replace=T),rep(0,ND)));
df;
##     tag  day amount
## 1  tag1 day1      0
## 2  tag1 day2    300
## 3  tag1 day3    200
## 4  tag1 day4      0
## 5  tag2 day1    400
## 6  tag2 day2    400
## 7  tag2 day3      0
## 8  tag2 day4    400
## 9  tag3 day1    200
## 10 tag3 day2    200
## 11 tag3 day3    200
## 12 tag3 day4    100
## 13 tag4 day1      0
## 14 tag4 day2      0
## 15 tag4 day3      0
## 16 tag4 day4      0
df$amount[df$amount==0] <- NA;
df$amount[is.na(df$amount)] <- ave(df$amount,df$tag,FUN=function(x) mean(x,na.rm=T))[is.na(df$amount)];
df;
##     tag  day amount
## 1  tag1 day1    250
## 2  tag1 day2    300
## 3  tag1 day3    200
## 4  tag1 day4    250
## 5  tag2 day1    400
## 6  tag2 day2    400
## 7  tag2 day3    400
## 8  tag2 day4    400
## 9  tag3 day1    200
## 10 tag3 day2    200
## 11 tag3 day3    200
## 12 tag3 day4    100
## 13 tag4 day1    NaN
## 14 tag4 day2    NaN
## 15 tag4 day3    NaN
## 16 tag4 day4    NaN

Frank's comment (on akrun's post) about NaNs applies here as well. If you don't want NaNs, I would just replace them with zero or NA or whatever default value you want after-the-fact, e.g. df$amount[is.nan(df$amount)] <- NA;.

Upvotes: 0

Related Questions