Reputation: 499
I have a data frame that has missing rows. You can identify the missing rows by looking at the gaps in the sequence.
Count<-c(1,1,1,1,2,2,2,3,3,4,4,4,4,5,5,6,6,6)
Seq<-c(1,2,3,4,1,2,4,1,4,1,2,3,5,1,3,1,2,3)
MyData<-c(5,4,5,3,4,3,2,1,2,1,3,2,4,2,3,1,4,3)
DF1<-data.frame(Count,Seq,MyData)
DF1
The count keeps track of the sequence number, and the sequences will always run as a numeric sequence. In this case it is 1:5, but this can vary so I don't want to hard code this limitation.
My goal is to create two new data frame that contains all of the missing sequence rows. The first would have NA in the data column for the added "missing' rows.
Count2<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6)
Seq2<-c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
MyData2<-c(5,4,5,3,NA,4,3,NA,2,NA,1,NA,NA,2,NA,1,3,2,NA,4,2,NA,3,NA,NA,1,4,3,NA,NA)
DF2<-data.frame(Count2,Seq2,MyData2)
DF2
The second data frame would be similar, but would contain the last known data point for that Sequence number.
Count2<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6)
Seq2<-c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
MyData3<-c(5,4,5,3,NA,4,3,5,2,NA,1,3,5,2,NA,1,3,2,2,4,2,3,3,2,4,1,4,3,2,4)
DF3<-data.frame(Count2,Seq2,MyData3)
DF3
Note- the NA remains for the 5th element in the sequence for this first 3 counts as there was no initial value.
Upvotes: 1
Views: 252
Reputation: 50668
Here is a solution using
merge
on a "skeleton" dataframe
for the first question, and zoo::na.locf
to replace NA
s with last known values for the second question.As per OP's request, the maximum Seq
and Count
values are inferred from the data.
# These are the maximum seq and count numbers from the data
maxSeq <- max(DF1$Seq);
maxCts <- max(DF1$Count);
# Replicating DF1
# Construct "skeleton" dataframe with appropriate Seq and Count sequences
df.one <- data.frame(
Count = rep(seq(1:maxCts), each = maxSeq),
Seq = rep(seq(1:maxSeq), maxCts)
);
# Merge with source data, and put NAs for missing entries
df.one <- merge(df.one, DF1, all = TRUE);
tail(df.one)
# Count Seq MyData
#25 5 5 NA
#26 6 1 1
#27 6 2 4
#28 6 3 3
#29 6 4 NA
#30 6 5 NA
# Replicating DF2
# Split on Seq, replace NAs in MyData with last known value,
# and rbind into dataframe
df.two <- do.call(rbind.data.frame, lapply(split(df.two, df$Seq), function(x) {
x$MyData <- na.locf(x$MyData);
return(x);
}))
# Sort by Count then Seq
df.two <- df.two[order(df.two$Count, df.two$Seq), ];
rownames(df.two) <- NULL;
tail(df.two);
# Count Seq MyData
#25 5 5 3
#26 6 1 1
#27 6 2 4
#28 6 3 3
#29 6 4 3
#30 6 5 2
Count<-c(1,1,1,1,2,2,2,3,3,4,4,4,4,5,5,6,6,6)
Seq<-c(1,2,3,4,1,2,4,1,4,1,2,3,5,1,3,1,2,3)
MyData<-c(5,4,5,3,4,3,2,1,2,1,3,2,4,2,3,1,4,3)
DF1<-data.frame(Count,Seq,MyData)
Upvotes: 1
Reputation: 39154
A solution using dplyr
and tidyr
.
library(dplyr)
library(tidyr)
DF2 <- DF1 %>%
complete(Count, Seq = full_seq(Seq, period = 1)) %>%
arrange(Count, Seq)
DF3 <- DF2 %>%
arrange(Seq, Count) %>%
group_by(Seq) %>%
fill(MyData) %>%
arrange(Count) %>%
ungroup()
DF2
# # A tibble: 30 x 3
# Count Seq MyData
# <dbl> <dbl> <dbl>
# 1 1 1 5
# 2 1 2 4
# 3 1 3 5
# 4 1 4 3
# 5 1 5 NA
# 6 2 1 4
# 7 2 2 3
# 8 2 3 NA
# 9 2 4 2
# 10 2 5 NA
# # ... with 20 more rows
DF3
# # A tibble: 30 x 3
# Count Seq MyData
# <dbl> <dbl> <dbl>
# 1 1 1 5
# 2 1 2 4
# 3 1 3 5
# 4 1 4 3
# 5 1 5 NA
# 6 2 1 4
# 7 2 2 3
# 8 2 3 5
# 9 2 4 2
# 10 2 5 NA
# # ... with 20 more rows
Upvotes: 2