Arthur Boari
Arthur Boari

Reputation: 11

I need to detect zero sequences and replace them with NAs

I am working with air quality data and I need to detect zero sequences (2+ zeros in sequence) and replace each element of this sequence with NA. The solo zero values must remain in the data.

Here's and example of the data:

date    TEMP    PM10    O3 (ug/m3)
5/25/2012 18:00:00  23,8    55  6,30397494404564
5/25/2012 19:00:00  22,8    75  0
5/25/2012 20:00:00  19,8    75  1,99689085129112
5/25/2012 21:00:00  15,3    98  11,1542397707455
5/25/2012 22:00:00  16,2    64  2,02173552751248
5/25/2012 23:00:00  16,3    44  0
5/25/2012 0:00:00   17,1    65  0
5/26/2012 1:00:00   17,5    73  0
5/26/2012 2:00:00   17,2    62  0
5/26/2012 3:00:00   17,1    45  0
5/26/2012 4:00:00   17  37  0
5/26/2012 5:00:00   17,3    29  0
5/26/2012 6:00:00   17,2    50  0
5/26/2012 7:00:00   17,1    36  0
5/26/2012 8:00:00   17,1    43  0
5/26/2012 9:00:00   17,9    45  0
5/26/2012 10:00:00  19,5    72  0
5/26/2012 11:00:00  21,3    85  3,57609276547571
5/26/2012 12:00:00  22,3    81  12,8699598468684

So, here I am with a solution: df$oz<-ifelse(df$`O3 (ug/m3)`==0 & lag(df$`O3 (ug/m3)`)==0,NA,df$`O3 (ug/m3)`)

date    TEMP    PM10    O3 (ug/m3)  oz
5/25/2012 18:00:00  23,8    55  6,30397494404564    6,30397494404564
5/25/2012 19:00:00  22,8    75  0   0
5/25/2012 20:00:00  19,8    75  1,99689085129112    1,99689085129112
5/25/2012 21:00:00  15,3    98  11,1542397707455    11,1542397707455
5/25/2012 22:00:00  16,2    64  2,02173552751248    2,02173552751248
5/25/2012 23:00:00  16,3    44  0   NA
5/25/2012 0:00:00   17,1    65  0   NA
5/26/2012 1:00:00   17,5    73  0   NA
5/26/2012 2:00:00   17,2    62  0   NA
5/26/2012 3:00:00   17,1    45  0   NA
5/26/2012 4:00:00   17  37  0   NA
5/26/2012 5:00:00   17,3    29  0   NA
5/26/2012 6:00:00   17,2    50  0   NA
5/26/2012 7:00:00   17,1    36  0   NA
5/26/2012 8:00:00   17,1    43  0   NA
5/26/2012 9:00:00   17,9    45  0   NA
5/26/2012 10:00:00  19,5    72  0   NA
5/26/2012 11:00:00  21,3    85  3,57609276547571    3,57609276547571
5/26/2012 12:00:00  22,3    81  12,8699598468684    12,8699598468684

Upvotes: 1

Views: 51

Answers (1)

Spacedman
Spacedman

Reputation: 94267

This function works by computing the run lengths of consecutive zeroes, then replacing any runs longer than 1 with NA in the original vector:

NArun0 = function(v){
   vr=rle(v==0)
   vr$values = vr$length>1 & vr$values
   rv=inverse.rle(vr) 
   v[rv]=NA
   v
}

And a bunch of simple tests:

> NArun0(c(0,0,0,0))
[1] NA NA NA NA
> NArun0(c(0,1,1,0))
[1] 0 1 1 0
> NArun0(c(0,1,0,2,0))
[1] 0 1 0 2 0
> NArun0(c(0,1,0,0,2,0))
[1]  0  1 NA NA  2  0
> NArun0(c(0,1,0,0,2,0,0))
[1]  0  1 NA NA  2 NA NA
> NArun0(c(0,0,1,0,0,2,0,0))
[1] NA NA  1 NA NA  2 NA NA

Note that the answer given in the question returns NA for all zeroes for me:

> v = c(1,0,2,3,4,0,0,0,0,9)
> ifelse(v==0 & lag(v)==0, NA, v)
 [1]  1 NA  2  3  4 NA NA NA NA  9

I thought maybe it was because lag was operating on a time series vector, but if I convert v to a time series vector:

> v = ts(v)
> ifelse(v==0 & lag(v)==0, NA, v)
Time Series:
Start = 1 
End = 9 
Frequency = 1 
[1]  1  0  2  3  4 NA NA NA  0

I do get a zero, but also a zero in last place, because of the way lag works. So I dont understand how the code in the question gives the answer.

Upvotes: 1

Related Questions