NaiveBayesian
NaiveBayesian

Reputation: 141

R Dplyr: Adding Missing Rows into Gaps in a Column of Integers

I am using the dplyr package in R to "summarise" some input data.

Step 1: Count the number of records (N), on a given date (Date), with a certain index value (Idx) in additional to another categorical value in column A.

T <- orig_data %>% group_by(A, Date, Idx) %>% summarise(N=n())

Step 2: Create a cumlative sum of the number of counts on a given date with a given Idx value or above.

T2 <- T %>% mutate(cN=cumsum(N))

However, some values of Idx between its maximum and minimum are not present in the table, e.g. Idx=10 is missing in the example below.

A   Date        Idx N   cN
N   2020-10-01  8   18  85
N   2020-10-01  9   6   91
N   2020-10-01  11  1   92
N   2020-10-01  13  10  102

I would like to add a row for each missing Idx value, on each day when it is missing, e.g.

N   2020-10-01  10  0   91  

I tried using the simplest possible version of the fill function to attempt this, but it produced no noticeable change in the results (maybe because Idx is being interpreted as a list and not a vector?)

T3 <- T2 %>% fill(cN)

Is there a way to easily add the above row, and identify any other missing rows and add them to the table?

Upvotes: 1

Views: 963

Answers (2)

www
www

Reputation: 39154

A solution using tidyr's functions: complete and nesting. I saw you commented in another post saying you have tried the complete function, but it is slow. Try nesting the columns to see if that helps.

library(dplyr)
library(tidyr)

dat2 <- dat %>%
  complete(nesting(A, Date), Idx = full_seq(Idx, period = 1), fill = list(N = 0)) %>%
  fill(cN)
dat2
# # A tibble: 6 x 5
#   A     Date         Idx     N    cN
#   <chr> <chr>      <dbl> <dbl> <int>
# 1 N     2020-10-01     8    18    85
# 2 N     2020-10-01     9     6    91
# 3 N     2020-10-01    10     0    91
# 4 N     2020-10-01    11     1    92
# 5 N     2020-10-01    12     0    92
# 6 N     2020-10-01    13    10   102

Upvotes: 1

Ian Cero
Ian Cero

Reputation: 370

The best solution I've found (I've run into the same problem before) is to create a stand-in dataframe, which includes the completed sequence I care about. I then join my real, main dataset into that stand-in.

It's a little hacky, but it can be accomplished in only a few lines. It also produces the results of interest.

library(dplyr)

df <- data.frame(idx = c(8, 9, 11, 12), n = c(11, 22, 33, 44)) %>% 
  left_join(
    x = data.frame(idx = seq(min(.$idx), max(.$idx))),
    y = .)

df

Output:

  idx  n
1   8 11
2   9 22
3  10 NA
4  11 33
5  12 44

Upvotes: 1

Related Questions