user113156
user113156

Reputation: 7107

multiple maps to mutate a data frame and add a column

I am working with some data which looks like:

# A tibble: 2 x 3
  splits          id        inner_resamples 
  <named list>    <chr>     <named list>    
1 <split [20/20]> Resample1 <tibble [6 x 2]>
2 <split [20/20]> Resample2 <tibble [6 x 2]>

What I want to do is to map over the inner_resamples column and map again over the splits column in each inner_resamples column. For each list I would like to map again.

The way to do this is to use the analysis function from the rsample package.

map(cv_rolling$inner_resamples$`1`$splits, ~ analysis(.x)) %>% tail()

What I would like to do is to map over each of the outputs and create new data 7 columns:

    > map(cv_rolling$inner_resamples$`1`$splits, ~ analysis(.x)) %>% tail()
[[1]]
# A tibble: 2 x 4
  time       ID    Value   out
  <date>     <chr> <dbl> <dbl>
1 2016-12-13 CAT1   796.     1
2 2016-12-14 CAT1   797.     0

[[2]]
# A tibble: 2 x 4
  time       ID    Value   out
  <date>     <chr> <dbl> <dbl>
1 2016-12-15 CAT1   798.     1
2 2016-12-16 CAT1   791.     0

[[3]]
# A tibble: 2 x 4
  time       ID    Value   out
  <date>     <chr> <dbl> <dbl>
1 2016-12-19 CAT1   794.     1
2 2016-12-20 CAT1   796.     0

[[4]]
# A tibble: 2 x 4
  time       ID    Value   out
  <date>     <chr> <dbl> <dbl>
1 2016-12-21 CAT1   795.     0
2 2016-12-22 CAT1   791.     0

[[5]]
# A tibble: 2 x 4
  time       ID    Value   out
  <date>     <chr> <dbl> <dbl>
1 2016-12-23 CAT1   790.     0
2 2016-12-27 CAT1   792.     1

[[6]]
# A tibble: 2 x 4
  time       ID    Value   out
  <date>     <chr> <dbl> <dbl>
1 2016-12-28 CAT1   785.     0
2 2016-12-29 CAT1   783.     0

Expected output would be (for 1 of the outputs)

[[6]]
# A tibble: 2 x 4
  time       ID    Value   out    NEWCOL
  <date>     <chr> <dbl> <dbl>    
1 2016-12-28 CAT1   785.     0    8677 
2 2016-12-29 CAT1   783.     0    8757

However I would like to also do this for each N in the data:

map(cv_rolling$inner_resamples$`N`$splits, ~ analysis(.x)) %>% tail()

Where N here can be accessed by:

cv_rolling$inner_resamples[[1]]
cv_rolling$inner_resamples[[2]]
cv_rolling$inner_resamples[[N]]

New Data:

structure(list(time = structure(c(17136, 17137, 17140, 17141, 
17142, 17143, 17144, 17147, 17148, 17149, 17150, 17151, 17154, 
17155, 17156, 17157, 17158, 17162, 17163, 17164, 17165, 17136, 
17137, 17140, 17141, 17142, 17143, 17144, 17147, 17148, 17149, 
17150, 17151, 17154, 17155, 17156, 17157, 17158, 17162, 17163, 
17164, 17165), class = "Date"), ID = c("CAT1", "CAT1", "CAT1", 
"CAT1", "CAT1", "CAT1", "CAT1", "CAT1", "CAT1", "CAT1", "CAT1", 
"CAT1", "CAT1", "CAT1", "CAT1", "CAT1", "CAT1", "CAT1", "CAT1", 
"CAT1", "CAT1", "CAT2", "CAT2", "CAT2", "CAT2", "CAT2", "CAT2", 
"CAT2", "CAT2", "CAT2", "CAT2", "CAT2", "CAT2", "CAT2", "CAT2", 
"CAT2", "CAT2", "CAT2", "CAT2", "CAT2", "CAT2", "CAT2"), Value = c(747.919983, 
750.5, 762.52002, 759.109985, 771.190002, 776.419983, 789.289978, 
789.27002, 796.099976, 797.070007, 797.849976, 790.799988, 794.200012, 
796.419983, 794.559998, 791.26001, 789.909973, 791.549988, 785.049988, 
782.789978, 771.820007, 56.283112, 56.330643, 57.252861, 56.996159, 
58.346195, 58.003925, 58.916634, 59.106773, 59.876858, 59.591648, 
59.496574, 59.230362, 60.485325, 60.409275, 60.409275, 60.418777, 
60.124058, 60.162071, 59.886375, 59.800812, 59.078251), out = c(0, 
1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 
1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0)), row.names = c(NA, 
-42L), index_quo = ~date, index_time_zone = "UTC", class = c("tbl_time", 
"tbl_df", "tbl", "data.frame"))

Also need to run:

library(rsample)
library(purrr)
library(tibbletime)

periods_train <- 2
periods_test  <- 1
skip_span     <- 1

cv_rolling <- nested_cv(df, 
                        outside = group_vfold_cv(group = "ID"),
                        inside = rolling_origin(
                          initial    = periods_train,
                          assess     = periods_test,
                          cumulative = FALSE,
                          skip       = skip_span))

Where the following can be run:

map(cv_rolling$inner_resamples$`2`$splits, ~ analysis(.x))

Which is what I am trying to modify / create new data from.

Upvotes: 2

Views: 183

Answers (1)

TimTeaFan
TimTeaFan

Reputation: 18541

I am not sure, what kind of function you want to apply to generate NEWCOL, but here is some toy example for your data which just divides the original Value column by 10:

cv_rolling %>% 
  mutate(data  = map(inner_resamples, "splits"),
         data2 = map_depth(data, 2, rsample::analysis),
         data3 = map_depth(data2, 2, ~ mutate(.x, NEWCOL = Value/10)))

If the mutate call is rather complex you could put it in a helper function.

mutate_helper <- function(df) {
  mutate(df, NEWCOL = Value/10)
}

cv_rolling %>% 
  mutate(data  = map(inner_resamples, "splits"),
         data2 = map_depth(data, 2, rsample::analysis),
         data3 = map_depth(data2, 2, mutate_helper))

Upvotes: 1

Related Questions