Complete data frame with missing date ranges for multiple parameters

Question

I have the following data frame:

Date_from <- c("2013-02-01","2013-05-10","2013-08-13","2013-02-01","2013-05-10","2013-08-13","2013-02-01","2013-05-10","2013-08-13")
Date_to <- c("2013-05-07","2013-08-12","2013-11-18","2013-05-07","2013-08-12","2013-11-18","2013-05-07","2013-08-12","2013-11-18")
y <- data.frame(Date_from,Date_to)
y$concentration <- c("1.5","2.5","1.5","3.5","1.5","2.5","1.5","3.5","3")
y$Parameter<-c("A","A","A","B","B","B","C","C","C")
y$Date_from <- as.Date(y$Date_from)
y$Date_to <- as.Date(y$Date_to)
y$concentration <- as.numeric(y$concentration)

I will need to check the data frame if for EACH Parameter the date range begins at the first day of the year (2013-01-01) and ends at the last day of the year (2013-12-31). If not I will need to add an extra row at the beginning and at the end for each of the parameters to complete the date range to a full year for each parameter. The result should look like this:

Date_from    Date_to concentration Parameter
2013-01-01 2013-01-31            NA        NA
2013-02-01 2013-05-07           1.5         A
2013-05-10 2013-08-12           2.5         A
2013-08-13 2013-11-18           1.5         A
2013-11-19 2013-12-31            NA        NA
2013-01-01 2013-01-31            NA        NA
2013-02-01 2013-05-07           3.5         B
2013-05-10 2013-08-12           1.5         B
2013-08-13 2013-11-18           2.5         B
2013-11-19 2013-12-31            NA        NA
2013-01-01 2013-01-31            NA        NA
2013-02-01 2013-05-07           1.5         C
2013-05-10 2013-08-12           3.5         C
2013-08-13 2013-11-18           3.0         C
2013-11-19 2013-12-31            NA        NA

Please note: The date ranges are only equal in this example for simplification.

UPDATE: This is my original data snippet and code:

sm<-read.csv("https://www.dropbox.com/s/tft6inwcrjqujgt/Test_data.csv?dl=1",sep=";",header=TRUE)
cleaned_sm<-sm[,c(4,5,11,14)] ##Delete obsolete columns
colnames(cleaned_sm)<-c("Parameter","Concentration","Date_from","Date_to")
cleaned_sm$Date_from<-as.Date(cleaned_sm$Date_from, format ="%d.%m.%Y")     
cleaned_sm$Date_to<-as.Date(cleaned_sm$Date_to, format ="%d.%m.%Y") 
#detect comma decimal separator and replace with dot decimal separater as comma is not recognised as a number
cleaned_sm=lapply(cleaned_sm, function(x) gsub(",", ".", x))
cleaned_sm<-data.frame(cleaned_sm)
cleaned_sm$Concentration <- as.numeric(cleaned_sm$Concentration)
cleaned_sm$Date_from <- as.Date(cleaned_sm$Date_from)
cleaned_sm$Date_to <- as.Date(cleaned_sm$Date_to)

Added code based on @jasbner:

cleaned_sm %>%
   group_by(Parameter) %>%
   do(add_row(.,
                 Date_from = ymd(max(Date_to))+1 ,
                 Date_to = ymd(paste(year(max(Date_to)),"1231")),
                 Parameter = .$Parameter[1])) %>%
   do(add_row(.,
                 Date_to = ymd(min(Date_from))-1, 
                 Date_from = ymd(paste(year(min(Date_from)),"0101")) ,
                 Parameter = .$Parameter[1],
                 .before = 0)) %>% 
   filter(!duplicated(Date_from,fromLast = T),!duplicated(Date_to))

jasbner · Accepted Answer

My attempt with dplyr and lubridate. Hacked together but I think it should work. Note this does not look for any gaps in the middle of the date ranges. Basically, for each group, you add a row before and after that particular group. Then if there are any cases where the date range starts at the beginning of the year or ends at the end of the year the added rows are filtered out.

library(dplyr)
library(lubridate)
cleaned_sm %>%
  group_by(Parameter) %>%
  do(add_row(.,
             Date_from = ymd(max(.$Date_to))+1 ,
             Date_to = ymd(paste(year(max(.$Date_to)),"1231")),
             Parameter = .$Parameter[1])) %>%
  do(add_row(.,
             Date_to = ymd(min(.$Date_from))-1, 
             Date_from = ymd(paste(year(min(.$Date_from)),"0101")) ,
             Parameter = .$Parameter[1],
             .before = 0)) %>% 
  filter(!duplicated(Date_from,fromLast = T),!duplicated(Date_to))  

# A tibble: 15 x 4
# Groups: Parameter [3]
#    Date_from  Date_to    concentration Parameter
#                           
#  1 2013-01-01 2013-01-31         NA    A        
#  2 2013-02-01 2013-05-07          1.50 A        
#  3 2013-05-10 2013-08-12          2.50 A        
#  4 2013-08-13 2013-11-18          1.50 A        
#  5 2013-11-19 2013-12-31         NA    A        
#  6 2013-01-01 2013-01-31         NA    B        
#  7 2013-02-01 2013-05-07          3.50 B        
#  8 2013-05-10 2013-08-12          1.50 B        
#  9 2013-08-13 2013-11-18          2.50 B        
# 10 2013-11-19 2013-12-31         NA    B        
# 11 2013-01-01 2013-01-31         NA    C        
# 12 2013-02-01 2013-05-07          1.50 C        
# 13 2013-05-10 2013-08-12          3.50 C        
# 14 2013-08-13 2013-11-18          3.00 C        
# 15 2013-11-19 2013-12-31         NA    C

Complete data frame with missing date ranges for multiple parameters

Answers (2)

Related Questions