Sophie
Sophie

Reputation: 97

How to fill missing data (not NA value) with value 0?

My data are like:

 Date Value
00:00    10
01:00     8
02:00     1
04:00     4
...

Some data are missing if the value=0. My question is how to fill these data back in. Like, after 02:00 17, fill in a row 03:00 0.

I have done some search, but only found solutions to replace NAs with 0. In my case, my data are not even showing in the data frame. Is there a way to check whether there's a gap between adjacent data?

Upvotes: 2

Views: 225

Answers (2)

rawr
rawr

Reputation: 20811

Two simple ways I can think of in base r:

s <- format(seq(s <- as.POSIXct('2000-01-01'), s + 3.6e4, by = 'hour'), '%H:%M')
# [1] "00:00" "01:00" "02:00" "03:00" "04:00" "05:00" "06:00" "07:00" "08:00"
# [10] "09:00" "10:00"
ss <- s[c(1:3, 5)]
dd <- data.frame(hour = ss, value = c(10, 8, 1, 4), stringsAsFactors = FALSE)

#    hour value
# 1 00:00    10
# 2 01:00     8
# 3 02:00     1
# 4 04:00     4

I made two sample vectors, s are the times you want to fill in, and ss are the times in your data frame that you have. presumably you already have both of these, so then you can

  1. create a data frame with the sequence of times you want and merge the two with all = TRUE so that there are no duplicates; then replace the NA with 0

dm <- data.frame(hour = s)
out <- merge(dm, dd, all = TRUE)

#     hour value
# 1  00:00    10
# 2  01:00     8
# 3  02:00     1
# 4  03:00    NA
# 5  04:00     4
# 6  05:00    NA
# 7  06:00    NA
# 8  07:00    NA
# 9  08:00    NA
# 10 09:00    NA
# 11 10:00    NA

out[is.na(out)] <- 0

#     hour value
# 1  00:00    10
# 2  01:00     8
# 3  02:00     1
# 4  03:00     0
# 5  04:00     4
# 6  05:00     0
# 7  06:00     0
# 8  07:00     0
# 9  08:00     0
# 10 09:00     0
# 11 10:00     0
  1. or you can give the exact hours you want in a vector or take the set difference between the times you want and the times you have and order the results:

## giving the times explicitly
out <- rbind(dd, data.frame(hour = sprintf('%02s:00', c(3, 5:10)), value = 0))
## or more programmatically:
out <- rbind(dd, data.frame(hour = setdiff(s, dd$hour),
                            value = 0))
out[order(out$hour), ]

#     hour value
# 1  00:00    10
# 2  01:00     8
# 3  02:00     1
# 5  03:00     0
# 4  04:00     4
# 6  05:00     0
# 7  06:00     0
# 8  07:00     0
# 9  08:00     0
# 10 09:00     0
# 11 10:00     0

Upvotes: 2

Colonel Beauvel
Colonel Beauvel

Reputation: 31161

Here is an approach using data.table:

library(data.table)

data = data.frame(Date=as.Date(c('2015-03-20','2015-03-24','2015-03-25','2015-03-28')), 
                  Value=c(1,2,3,4)) 

#        Date Value
#1 2015-03-20     1
#2 2015-03-24     2
#3 2015-03-25     3
#4 2015-03-28     4

dt = data.table(Date=seq(min(data$Date), max(data$Date), by='days'))
setkey(setDT(data), Date)[dt][!data, Value:=0][]

#         Date Value
#1: 2015-03-20     1
#2: 2015-03-21     0
#3: 2015-03-22     0
#4: 2015-03-23     0
#5: 2015-03-24     2
#6: 2015-03-25     3
#7: 2015-03-26     0
#8: 2015-03-27     0
#9: 2015-03-28     4

It is basically a join on the resampling table - setkey(setDT(data), Date)[dt] - you want (you have to define it, here this is dt). Then you replace values not present in your original dataset with 0's - [!data, Value:=0]

Upvotes: 3

Related Questions