Matt
Matt

Reputation: 333

Complete data frame with missing date ranges for multiple parameters

I have the following data frame:

Date_from <- c("2013-02-01","2013-05-10","2013-08-13","2013-02-01","2013-05-10","2013-08-13","2013-02-01","2013-05-10","2013-08-13")
Date_to <- c("2013-05-07","2013-08-12","2013-11-18","2013-05-07","2013-08-12","2013-11-18","2013-05-07","2013-08-12","2013-11-18")
y <- data.frame(Date_from,Date_to)
y$concentration <- c("1.5","2.5","1.5","3.5","1.5","2.5","1.5","3.5","3")
y$Parameter<-c("A","A","A","B","B","B","C","C","C")
y$Date_from <- as.Date(y$Date_from)
y$Date_to <- as.Date(y$Date_to)
y$concentration <- as.numeric(y$concentration)

I will need to check the data frame if for EACH Parameter the date range begins at the first day of the year (2013-01-01) and ends at the last day of the year (2013-12-31). If not I will need to add an extra row at the beginning and at the end for each of the parameters to complete the date range to a full year for each parameter. The result should look like this:

Date_from    Date_to concentration Parameter
2013-01-01 2013-01-31            NA        NA
2013-02-01 2013-05-07           1.5         A
2013-05-10 2013-08-12           2.5         A
2013-08-13 2013-11-18           1.5         A
2013-11-19 2013-12-31            NA        NA
2013-01-01 2013-01-31            NA        NA
2013-02-01 2013-05-07           3.5         B
2013-05-10 2013-08-12           1.5         B
2013-08-13 2013-11-18           2.5         B
2013-11-19 2013-12-31            NA        NA
2013-01-01 2013-01-31            NA        NA
2013-02-01 2013-05-07           1.5         C
2013-05-10 2013-08-12           3.5         C
2013-08-13 2013-11-18           3.0         C
2013-11-19 2013-12-31            NA        NA

Please note: The date ranges are only equal in this example for simplification.

UPDATE: This is my original data snippet and code:

sm<-read.csv("https://www.dropbox.com/s/tft6inwcrjqujgt/Test_data.csv?dl=1",sep=";",header=TRUE)
cleaned_sm<-sm[,c(4,5,11,14)] ##Delete obsolete columns
colnames(cleaned_sm)<-c("Parameter","Concentration","Date_from","Date_to")
cleaned_sm$Date_from<-as.Date(cleaned_sm$Date_from, format ="%d.%m.%Y")     
cleaned_sm$Date_to<-as.Date(cleaned_sm$Date_to, format ="%d.%m.%Y") 
#detect comma decimal separator and replace with dot decimal separater as comma is not recognised as a number
cleaned_sm=lapply(cleaned_sm, function(x) gsub(",", ".", x))
cleaned_sm<-data.frame(cleaned_sm)
cleaned_sm$Concentration <- as.numeric(cleaned_sm$Concentration)
cleaned_sm$Date_from <- as.Date(cleaned_sm$Date_from)
cleaned_sm$Date_to <- as.Date(cleaned_sm$Date_to)

Added code based on @jasbner:

cleaned_sm %>%
   group_by(Parameter) %>%
   do(add_row(.,
                 Date_from = ymd(max(Date_to))+1 ,
                 Date_to = ymd(paste(year(max(Date_to)),"1231")),
                 Parameter = .$Parameter[1])) %>%
   do(add_row(.,
                 Date_to = ymd(min(Date_from))-1, 
                 Date_from = ymd(paste(year(min(Date_from)),"0101")) ,
                 Parameter = .$Parameter[1],
                 .before = 0)) %>% 
   filter(!duplicated(Date_from,fromLast = T),!duplicated(Date_to))

Upvotes: 0

Views: 94

Answers (2)

jasbner
jasbner

Reputation: 2283

My attempt with dplyr and lubridate. Hacked together but I think it should work. Note this does not look for any gaps in the middle of the date ranges. Basically, for each group, you add a row before and after that particular group. Then if there are any cases where the date range starts at the beginning of the year or ends at the end of the year the added rows are filtered out.

library(dplyr)
library(lubridate)
cleaned_sm %>%
  group_by(Parameter) %>%
  do(add_row(.,
             Date_from = ymd(max(.$Date_to))+1 ,
             Date_to = ymd(paste(year(max(.$Date_to)),"1231")),
             Parameter = .$Parameter[1])) %>%
  do(add_row(.,
             Date_to = ymd(min(.$Date_from))-1, 
             Date_from = ymd(paste(year(min(.$Date_from)),"0101")) ,
             Parameter = .$Parameter[1],
             .before = 0)) %>% 
  filter(!duplicated(Date_from,fromLast = T),!duplicated(Date_to))  

# A tibble: 15 x 4
# Groups: Parameter [3]
#    Date_from  Date_to    concentration Parameter
#    <date>     <date>             <dbl> <chr>    
#  1 2013-01-01 2013-01-31         NA    A        
#  2 2013-02-01 2013-05-07          1.50 A        
#  3 2013-05-10 2013-08-12          2.50 A        
#  4 2013-08-13 2013-11-18          1.50 A        
#  5 2013-11-19 2013-12-31         NA    A        
#  6 2013-01-01 2013-01-31         NA    B        
#  7 2013-02-01 2013-05-07          3.50 B        
#  8 2013-05-10 2013-08-12          1.50 B        
#  9 2013-08-13 2013-11-18          2.50 B        
# 10 2013-11-19 2013-12-31         NA    B        
# 11 2013-01-01 2013-01-31         NA    C        
# 12 2013-02-01 2013-05-07          1.50 C        
# 13 2013-05-10 2013-08-12          3.50 C        
# 14 2013-08-13 2013-11-18          3.00 C        
# 15 2013-11-19 2013-12-31         NA    C 

Upvotes: 1

akash87
akash87

Reputation: 3994

This seems like it requires a combination of different packages to attack it. I am using tidyr, data.table, and I used lubridate.

date.start <- seq.Date(as.Date("2013-01-01"), as.Date("2013-12-31"), by = "day")
Date.Int   <- data.frame(Date_from = date.start, Date_to = date.start)
y_wide     <- y %>% spread(Parameter, concentration)
y_wide     <- as.data.table(setkey(as.data.table(y_wide), Date_from, Date_to))
Date.Int    <- as.data.table(setkey(as.data.table(Date.Int), Date_from, Date_to))
dats <- foverlaps(Date.Int, y_wide, nomatch = NA)

fin.dat <- dats %>% 
mutate(A    = ifelse(is.na(A), -5, A), 
       seqs = cumsum(!is.na(A) & A != lag(A, default = -5))) %>% 
group_by(seqs) %>% 
summarise(Date_from = first(i.Date_from),                                                                                                    
          Date_to   = last(i.Date_to)   ,                                                                                                        
          A = first(A),                                                                                                        
          B = first(B),                                                                                                        
          C = first(C)) %>% 
          mutate(A = ifelse(A == -5, NA, A)) %>% 
          ungroup()%>% 
gather(Concentration, Parameter, A:C) %>% 
mutate(Concentration = ifelse(is.na(Parameter), NA, Concentration))

Okay, so I created a vector of dates from a start point to an end point (date.start); then I turned into a data.frame with the same interval names and interval dates for Date.Int. This is because foverlaps needs to compare two intervals (same date start and end dates in Date.Int are now officially intervals). I then took your data you provided and spread, turning it from long format data to wide format data and turned that into a data.table. keying a data.table sets up how it should be arranged, and when using foverlaps you have to key the start dates and end dates (in that order). foverlaps determines if an interval falls within another interval of dates. If you print out dats, you will see a bunch of lines with NA for everything because they did not fall within an interval. So now we have to group these in some manner. I picked grouping by values of "A" in dats. The grouping variable is called seqs. But then I summarised the data, and then switched it back from wide format to long format and replaced the appropriate NA values.

Upvotes: 0

Related Questions