Reputation: 4243
I have a dataset that contains weekly data. The week starts on a Monday and ends on a Sunday. This dataset is also broken out by group.
I want to detect if there are any missing consecutive dates between the start and finish for each group. Here is an example dataset:
Week<- as.Date(c('2015-04-13', '2015-04-20', '2015-05-04', '2015-06-29', '2015-07-27', '2015-08-03'))
Group <- c('A', 'A', 'A','B','B','B','B')
Value<- c(2,3,10,4,11,9,8)
df<-data.frame(Week, Group, Value)
df
Week Group Value
2015-04-13 A 2
2015-04-20 A 3
2015-05-04 A 10
2015-06-29 B 4
2015-07-06 B 11
2015-07-27 B 9
2015-08-03 B 8
For group B, there is missing data between 2015-07-06
and 2015-07-27
. There is also missing data in group A between 2015-04-20
and 2015-05-04
. I want to add a row for that group and have the value be NA
. I have many groups and I want my expected output to be below:
Week Group Value
2015-04-13 A 2
2015-04-20 A 3
2015-04-27 A NA
2015-05-04 A 10
2015-06-29 B 4
2015-07-06 B 11
2015-07-13 B NA
2015-07-20 B NA
2015-07-27 B 9
2015-08-03 B 8
Any help would be great, thanks!
Upvotes: 0
Views: 322
Reputation: 1210
This can be achieved using seq
function. Here is the code snippet.
Code:
Week<- as.Date(c('2015-04-13', '2015-04-20', '2015-04-27', '2015-05-04', '2015-06-29','2015-07-06', '2015-07-27', '2015-08-03'))
Group <- c('A', 'A','A', 'A','B','B','B','B')
Value<- c(2,3,2,10,4,11,9,8)
df<-data.frame(Week, Group, Value)
#generate all the missing dates
alldates = seq(min(df$Week[df$Group == 'B']), max(df$Week[df$Group == 'B']), 7)
#filter out the dates that are not present in your dataset
dates = alldates[!(alldates %in% df$Week)]
#add these new dates to a new dataframe and rbind with the old dataframe
new_df = data.frame(Week = dates,Group = 'B', Value = NA)
df = rbind(df, new_df)
df = df[order(df$Week),]
Output:
Week Group Value
1 2015-04-13 A 2
2 2015-04-20 A 3
3 2015-04-27 A 2
4 2015-05-04 A 10
5 2015-06-29 B 4
6 2015-07-06 B 11
9 2015-07-13 B NA
10 2015-07-20 B NA
7 2015-07-27 B 9
8 2015-08-03 B 8
Upvotes: 1
Reputation: 51582
You can use complete
from tidyr
package, i.e.
library(tidyverse)
df %>%
group_by(Group) %>%
complete(Week = seq(min(Week), max(Week), by = 'week'))
which gives,
# A tibble: 10 x 3 # Groups: Group [2] Group Week Value <fct> <date> <dbl> 1 A 2015-04-13 2 2 A 2015-04-20 3 3 A 2015-04-27 NA 4 A 2015-05-04 10 5 B 2015-06-29 4 6 B 2015-07-06 NA 7 B 2015-07-13 NA 8 B 2015-07-20 NA 9 B 2015-07-27 11 10 B 2015-08-03 9
Upvotes: 3
Reputation: 816
The only way I've found to do this is using an inequality join in SQL.
library(tidyverse)
library(sqldf)
Week<- as.Date(c('2015-04-13', '2015-04-20', '2015-04-27', '2015-05-04',
'2015-06-29', '2015-06-07', '2015-07-27', '2015-08-03'))
Group <- c('A', 'A','A', 'A','B','B','B','B')
Value<- c(2,3,2,10,4,11,9,8)
df<-data.frame(Week, Group, Value)
#what are the start and end weeks for each group?
GroupWeeks <- df %>%
group_by(Group) %>%
summarise(start = min(Week),
end = max(Week))
#What are all the possible weeks?
AllWeeks <- data.frame(Week = seq.Date(min(df$Week), max(df$Week), by = "week"))
#use an inequality join to add rows for every week within the group's range
sqldf("Select AllWeeks.Week, GroupWeeks.[Group], Value
From AllWeeks inner join GroupWeeks on AllWeeks.Week >= start AND AllWeeks.Week <= end
left join df on AllWeeks.Week = df.Week and GroupWeeks.[Group] = df.[Group]")
Upvotes: 1