Reputation: 11
I have a data frame of 400+
tag names with daily time series data for each tag. The values of some of tags are 0 where I do not have readings. I would like to calculate a mean of the non-zero amounts for each tag in the time series and fill the zero values with that mean.
example
tag1 day1 400
tag1 day2 200
tag1 day3 0
.
.
tag1 dayn 0
tag2 day1 0
tag2 day2 100
tag2 day3 0
...
Here I would like to fill the 0 values for tag2 with 100 and the 0 values for tag1 with 300
I can use ddply to subset the data frame and calculate means but I am looking for the best way to get means based on the non-zero entries for each tag and then fill the original zero values in the data frame with the mean values per tag. There seems to be a few ways to do this a bunch of lines of code but I suspect there much be a faster/more elegant approach. There are 400-500 tags in the data and about 150 daily readings
Upvotes: 1
Views: 96
Reputation: 269694
Here are some approaches:
1) sqldf This was moved from comments. The code below selects tag
, day
and one of two values for Amount
. For each row, if Amount
is 0 then it runs the inner correlated select and otherwise it uses the Amount
value.
library(sqldf)
sqldf("select
tag,
day,
case when a.Amount = 0
then (select avg(b.Amount)
from df1 b
where b.Amount != 0 and b.tag = a.tag)
else a.Amount
end Amount
from df1 a")
giving:
tag day Amount
1 tag1 day1 400
2 tag1 day2 200
3 tag1 day3 300
4 tag1 dayn 300
5 tag2 day1 100
6 tag2 day2 100
7 tag2 day3 100
2) na.aggregate Replace the zero values with NA and then apply na.aggregate
from zoo to each group using ave
:
library(zoo)
transform(df1, Amount = ave(replace(Amount, Amount == 0, NA), tag, FUN = na.aggregate))
Note We used the following as input:
df1 <- structure(list(tag = c("tag1", "tag1", "tag1", "tag1", "tag2",
"tag2", "tag2"), day = c("day1", "day2", "day3", "dayn", "day1",
"day2", "day3"), Amount = c(400L, 200L, 0L, 0L, 0L, 100L, 0L)), .Names = c("tag",
"day", "Amount"), class = "data.frame", row.names = c(NA, -7L))
Update: Added (2).
Upvotes: 2
Reputation: 887213
For large datasets, it may be efficient to use either data.table
or dplyr
.
Using data.table
, we convert the 'data.frame' to 'data.table' (setDT(df1)
), to avoid a clash for possible mismatch between the class
of 'Amount' and the mean
value (which would be 'numeric' class), we can first change 'Amount' to 'numeric' class (Amount := as.numeric(Amount)
), create a 'Mean' column by getting the mean
of all non-zero 'Amount' values grouped by 'tag', replace the values that are '0' for Amount (Amount==0
) with the corresponding value from 'Mean' column (Amount := Mean
), and if needed, we can remove the 'Mean' column by assigning to 'NULL'
library(data.table)
setDT(df1)[, Amount:= as.numeric(Amount)
][, Mean:= mean(Amount[Amount!=0]), by= tag
][Amount==0, Amount := Mean][, Mean:= NULL]
Or using dplyr
, we can group by 'tag' and use replace
to change the '0' values to mean
library(dplyr)
df1 %>%
group_by(tag)%>%
mutate(Amount= replace(Amount, which(Amount==0),
mean(Amount[Amount!=0])))
Or a possible sqldf
solution
library(sqldf)
res1 <- sqldf("select * from df1
left join (select tag,
avg(Amount) as Mean
from df1
where Amount is not 0
group by tag)
using (tag)")
sqldf("select tag, day,
case when Amount like 0
then Mean
else Amount
end Amount
from res1")
As mentioned by @Frank (in the comments), if a particular 'tag' group have only 0 as 'Amount', then the previous solution will get 'NaN' for the mean
. I am not sure what the expected value will be in that case. Suppose if we need to keep that as 0, the change in code (with inclusion of na.rm=TRUE
for the possible NA values)
df2 <- rbind(df1,list("tag3","day3",0))
setDT(df2)[, Amount := as.numeric(Amount)
][, Mean:= if(all(Amount==0)) 0
else mean(Amount[Amount!=0], na.rm=TRUE), by = tag
][Amount==0, Amount:= Mean][, Mean:= NULL]
df1 <- structure(list(tag = c("tag1", "tag1", "tag1", "tag1", "tag2",
"tag2", "tag2"), day = c("day1", "day2", "day3", "dayn", "day1",
"day2", "day3"), Amount = c(400L, 200L, 0L, 0L, 0L, 100L, 0L)),
.Names = c("tag",
"day", "Amount"), class = "data.frame", row.names = c(NA, -7L))
Upvotes: 2
Reputation: 35314
Here's a possible solution using ave()
:
set.seed(2); NT <- 4; ND <- 4; df <- data.frame(tag=rep(paste0('tag',1:NT),each=ND),day=rep(paste0('day',1:ND),NT),amount=c(sample(seq(0,400,100),ND*(NT-1),replace=T),rep(0,ND)));
df;
## tag day amount
## 1 tag1 day1 0
## 2 tag1 day2 300
## 3 tag1 day3 200
## 4 tag1 day4 0
## 5 tag2 day1 400
## 6 tag2 day2 400
## 7 tag2 day3 0
## 8 tag2 day4 400
## 9 tag3 day1 200
## 10 tag3 day2 200
## 11 tag3 day3 200
## 12 tag3 day4 100
## 13 tag4 day1 0
## 14 tag4 day2 0
## 15 tag4 day3 0
## 16 tag4 day4 0
df$amount[df$amount==0] <- NA;
df$amount[is.na(df$amount)] <- ave(df$amount,df$tag,FUN=function(x) mean(x,na.rm=T))[is.na(df$amount)];
df;
## tag day amount
## 1 tag1 day1 250
## 2 tag1 day2 300
## 3 tag1 day3 200
## 4 tag1 day4 250
## 5 tag2 day1 400
## 6 tag2 day2 400
## 7 tag2 day3 400
## 8 tag2 day4 400
## 9 tag3 day1 200
## 10 tag3 day2 200
## 11 tag3 day3 200
## 12 tag3 day4 100
## 13 tag4 day1 NaN
## 14 tag4 day2 NaN
## 15 tag4 day3 NaN
## 16 tag4 day4 NaN
Frank's comment (on akrun's post) about NaNs applies here as well. If you don't want NaNs, I would just replace them with zero or NA or whatever default value you want after-the-fact, e.g. df$amount[is.nan(df$amount)] <- NA;
.
Upvotes: 0