Reputation: 8494
I have a simple dataframe.
a <- c("06/12/2012 06:00","06/12/2012 06:05","06/12/2012 06:10","06/12/2012 06:15","06/12/2012 06:20","06/12/2012 06:25",
"06/12/2012 06:30","06/12/2012 06:35","06/12/2012 06:40","06/12/2012 06:45","06/12/2012 06:50","06/12/2012 06:55",
"06/12/2012 07:00","06/12/2012 07:05","06/12/2012 07:10","06/12/2012 07:15","06/12/2012 07:20","06/12/2012 07:25",
"06/12/2012 07:30","06/12/2012 07:35","06/12/2012 07:40","06/12/2012 07:45","06/12/2012 07:50","06/12/2012 07:55",
"06/12/2012 08:00")
a <- strptime(a, "%d/%m/%Y %H:%M")
b <-c("1","0","0","0","2","0","0","0","3","0","0","0","0","0","1","2","5","6","0","0","0","0","6","10","2")
df1 <- data.frame(a,b)
I want to use R to delete parts of my dataframe when there is insufficient valid data. Data is being recorded every 5 minutes. If there is 20 minutes or more of continuous data when only zeros are recorded in the 'b' column, these can be deleted from my final dataframe.
If anyone has any ideas to help me, I would very much appreciate it.
Upvotes: 2
Views: 81
Reputation: 118869
One solution using rle
(as Ben mentions under comments)
# get rle
t <- rle(as.numeric(as.character(df1$b)))
# check for condition. NOTE: here I assume all are 5 minute intervals!!
# So, if rle length >= 4, then its >= 20 minute interval
p <- which(t$values == 0 & t$lengths >= 4)
w <- cumsum(t$lengths)
o <- unlist(lapply(p, function(x) {
c((w[x-1]+1):w[x])
}))
df1[-o, ]
# a b
# 1 2012-12-06 06:00:00 1
# 2 2012-12-06 06:05:00 0
# 3 2012-12-06 06:10:00 0
# 4 2012-12-06 06:15:00 0
# 5 2012-12-06 06:20:00 2
# 6 2012-12-06 06:25:00 0
# 7 2012-12-06 06:30:00 0
# 8 2012-12-06 06:35:00 0
# 9 2012-12-06 06:40:00 3
# 15 2012-12-06 07:10:00 1
# 16 2012-12-06 07:15:00 2
# 17 2012-12-06 07:20:00 5
# 18 2012-12-06 07:25:00 6
# 23 2012-12-06 07:50:00 6
# 24 2012-12-06 07:55:00 10
# 25 2012-12-06 08:00:00 2
Upvotes: 2
Reputation: 89097
Another one, still using rle
:
is.zero <- df1$b == 0
is.zero.rle <- rle(is.zero)
df1[rep(is.zero.rle$lengths, is.zero.rle$lengths) * is.zero < 4, ]
It might help understand if I show the intermediate results:
rep(is.zero.rle$lengths, is.zero.rle$lengths) * is.zero
# [1] 0 3 3 3 0 3 3 3 0 5 5 5 5 5 0 0 0 0 4 4 4 4 0 0 0
Upvotes: 3