Reputation: 141
I am using the dplyr package in R to "summarise" some input data.
Step 1: Count the number of records (N), on a given date (Date), with a certain index value (Idx) in additional to another categorical value in column A.
T <- orig_data %>% group_by(A, Date, Idx) %>% summarise(N=n())
Step 2: Create a cumlative sum of the number of counts on a given date with a given Idx value or above.
T2 <- T %>% mutate(cN=cumsum(N))
However, some values of Idx between its maximum and minimum are not present in the table, e.g. Idx=10 is missing in the example below.
A Date Idx N cN
N 2020-10-01 8 18 85
N 2020-10-01 9 6 91
N 2020-10-01 11 1 92
N 2020-10-01 13 10 102
I would like to add a row for each missing Idx value, on each day when it is missing, e.g.
N 2020-10-01 10 0 91
I tried using the simplest possible version of the fill function to attempt this, but it produced no noticeable change in the results (maybe because Idx is being interpreted as a list and not a vector?)
T3 <- T2 %>% fill(cN)
Is there a way to easily add the above row, and identify any other missing rows and add them to the table?
Upvotes: 1
Views: 963
Reputation: 39154
A solution using tidyr
's functions: complete
and nesting
. I saw you commented in another post saying you have tried the complete
function, but it is slow. Try nesting
the columns to see if that helps.
library(dplyr)
library(tidyr)
dat2 <- dat %>%
complete(nesting(A, Date), Idx = full_seq(Idx, period = 1), fill = list(N = 0)) %>%
fill(cN)
dat2
# # A tibble: 6 x 5
# A Date Idx N cN
# <chr> <chr> <dbl> <dbl> <int>
# 1 N 2020-10-01 8 18 85
# 2 N 2020-10-01 9 6 91
# 3 N 2020-10-01 10 0 91
# 4 N 2020-10-01 11 1 92
# 5 N 2020-10-01 12 0 92
# 6 N 2020-10-01 13 10 102
Upvotes: 1
Reputation: 370
The best solution I've found (I've run into the same problem before) is to create a stand-in dataframe, which includes the completed sequence I care about. I then join my real, main dataset into that stand-in.
It's a little hacky, but it can be accomplished in only a few lines. It also produces the results of interest.
library(dplyr)
df <- data.frame(idx = c(8, 9, 11, 12), n = c(11, 22, 33, 44)) %>%
left_join(
x = data.frame(idx = seq(min(.$idx), max(.$idx))),
y = .)
df
Output:
idx n
1 8 11
2 9 22
3 10 NA
4 11 33
5 12 44
Upvotes: 1