JenB
JenB

Reputation: 17678

using merge to create blank rows

I am running multiple simulations with the same input parameters. Some simulations complete earlier than others and I need to extend the results of the shorter simulations so that I can analyse the data with all runs included. This means filling up 'short' runs with repeats of the final values until they are the same length as the 'long' runs with the same input parameters.

I would like a dplyr solution because the real datasets are massive and dplyr has fast joins.

Here is my attempt.

library(dplyr)
sims <- data.frame("run" = c(1, 1, 1, 2, 2, 3, 3),
                   "type" = c("A", "A", "A", "A", "A", "B", "B"),
                   "step" = c(0, 1, 2, 0, 1, 0, 1),
                   "value" = seq(1:7))
allSteps <- data.frame("type" = c("A", "A", "A", "B", "B"),
                       "step" = c(0, 1, 2, 0, 1))

merged <- full_join(sims, allSteps,
                    by = c("type", "step"))

This gets the output:

 run type step value
   1    A    0     1
   1    A    1     2
   1    A    2     3
   2    A    0     4
   2    A    1     5
   3    B    0     6
   3    B    1     7

But I actually want the following because run 2 is of type A and should therefore be expanded to the same length as run 1 (also type A):

 run type step value
   1    A    0     1
   1    A    1     2
   1    A    2     3
   2    A    0     4
   2    A    1     5
   2    A    2     NA   # extra line here
   3    B    0     6
   3    B    1     7

I will then use fill to get to my desired result of:

 run type step value
   1    A    0     1
   1    A    1     2
   1    A    2     3
   2    A    0     4
   2    A    1     5
   2    A    2     5    # filled replacement of NA
   3    B    0     6
   3    B    1     7

I am sure this is a duplicate of some question but the various search terms I used didn't manage to surface it.

Upvotes: 3

Views: 76

Answers (3)

moodymudskipper
moodymudskipper

Reputation: 47320

We can split the data frame by run and do a right_join on allSteps for each of them to have all the combinations you desire. Then we join back and fill.

It's a bit more general than current solutions in that you could have steps in allSteps that may not be in sims or in the sims subset you're working on.

library(tidyverse)

sims %>%
  split(.$run) %>%
  map_dfr(right_join,allSteps,.id = "id") %>%
  group_by(type,id) %>%
  fill(run,value,.direction="down") %>%
  ungroup %>%
  filter(!is.na(run)) %>%
  select(-id)

# # A tibble: 8 x 4
#     run   type  step value
#   <dbl> <fctr> <dbl> <int>
# 1     1      A     0     1
# 2     1      A     1     2
# 3     1      A     2     3
# 4     2      A     0     4
# 5     2      A     1     5
# 6     2      A     2     5
# 7     3      B     0     6
# 8     3      B     1     7

Upvotes: 0

zx8754
zx8754

Reputation: 56169

Using tidyr::complete to get missing combinations, then use fill to fill NAs with last non-NA value:

library(tidyr)

sims %>% 
  group_by(type) %>% 
  complete(run, step) %>% 
  select(run, type, step, value) %>% 
  ungroup() %>% 
  fill(value)

# # A tibble: 8 x 4
#     run type   step value
#   <dbl> <fct> <dbl> <int>
# 1  1.00 A      0        1
# 2  1.00 A      1.00     2
# 3  1.00 A      2.00     3
# 4  2.00 A      0        4
# 5  2.00 A      1.00     5
# 6  2.00 A      2.00     5
# 7  3.00 B      0        6
# 8  3.00 B      1.00     7

Upvotes: 1

mtoto
mtoto

Reputation: 24188

We don't really need the data.frame allSteps if at least one of the runs contains the full sequence for each type. Instead we can use tidyr::expand() in combination with a self-join:

library(tidyr)
sims %>% group_by(type) %>%
  expand(run, step) %>%
  full_join(sims, by = c("type", "step", "run")) %>%
  select(2,1,3,4)
#    run   type  step value
#  <dbl> <fctr> <dbl> <int>
#1     1      A     0     1
#2     1      A     1     2
#3     1      A     2     3
#4     2      A     0     4
#5     2      A     1     5
#6     2      A     2    NA
#7     3      B     0     6
#8     3      B     1     7

Upvotes: 2

Related Questions