user5249203
user5249203

Reputation: 4648

infer quarter column from month and month column from quarter

I have a list of data frames with same column names, however some df's have quarter information, and other have month information. Some have both or missing both. all data frames have year info. I am trying to build a condition and derive the missing info, to finally get new columns QtrYr and Date.

library(dplyr)
df <- dplyr::tibble(
  m = c(1, 2, NA, NA, NA, NA, 7, NA, 9, NA, NA, 12, NA),
  q = c(NA, NA, 1, 2, 2, 2, NA, 3, 3, 4, 4, 4, NA),
  y = c(2016, 2016, 2016, 2017, 2017, 2017, 2018 , 2018 , 2018 , 2020, 2020, 2020, 2020)
)
print(df)
#> # A tibble: 13 x 3
#>        m     q     y
#>    <dbl> <dbl> <dbl>
#>  1     1    NA  2016
#>  2     2    NA  2016
#>  3    NA     1  2016
#>  4    NA     2  2017
#>  5    NA     2  2017
#>  6    NA     2  2017
#>  7     7    NA  2018
#>  8    NA     3  2018
#>  9     9     3  2018
#> 10    NA     4  2020
#> 11    NA     4  2020
#> 12    12     4  2020
#> 13    NA    NA  2020

lsdf <- list(df1 = df, df2 = df)

desired output.

out_df <- dplyr::tibble(
  m = c(1, 2, NA, NA, NA, NA, 7, NA, 9, NA, NA, 12, NA),
  q = c(NA, NA, 1, 2, 2, 2, NA, 3, 3, 4, 4, 4, NA),
  y = c(2016, 2016, 2016, 2017, 2019, 2020, 2017, 2019, 2020, 2016, 2017, 2019, 2020),
  qy = c("Q1/2016", "Q1/2016", "Q1/2016", "Q2/2017", "Q2/2017", "Q2/2017", "Q3/2018", "Q3/2018", "Q3/2018", "Q4/2020", "Q4/2020", "Q4/2020", NA),
  dy = c("3/1/2016", "3/1/2016", "3/1/2016", "6/1/2017", "6/1/2017", "6/1/2017", "9/1/2018", "9/1/2018", "9/1/2018", "12/1/2020", "12/1/2020", "12/1/2020", NA)
)

print(out_df)
#> # A tibble: 13 x 5
#>        m     q     y qy      dy       
#>    <dbl> <dbl> <dbl> <chr>   <chr>    
#>  1     1    NA  2016 Q1/2016 3/1/2016 
#>  2     2    NA  2016 Q1/2016 3/1/2016 
#>  3    NA     1  2016 Q1/2016 3/1/2016 
#>  4    NA     2  2017 Q2/2017 6/1/2017 
#>  5    NA     2  2019 Q2/2017 6/1/2017 
#>  6    NA     2  2020 Q2/2017 6/1/2017 
#>  7     7    NA  2017 Q3/2018 9/1/2018 
#>  8    NA     3  2019 Q3/2018 9/1/2018 
#>  9     9     3  2020 Q3/2018 9/1/2018 
#> 10    NA     4  2016 Q4/2020 12/1/2020
#> 11    NA     4  2017 Q4/2020 12/1/2020
#> 12    12     4  2019 Q4/2020 12/1/2020
#> 13    NA    NA  2020 <NA>    <NA>

I tried to use case_when, thought it is fairly straightforward but looks like either I am not passing it as expected or totally in wrong direction.

lsdf$df1 %>% dplyr::mutate(
  Qrt = dplyr::case_when(
   is.na(m) & is.na(q) ~ NA,
   is.na(m) & !is.na(q) ~ q,
   m != NULL & q == NA ~ paste0("Q",ceiling(as.numeric(m)/3)),
   m != NULL & q != NULL ~ paste0("Q", q)
))
#> Error: `m != NULL & q == NA ~ paste0("Q", ceiling(as.numeric(m)/3))`, `m != NULL & q != NULL ~ paste0("Q", q)` must be length 13 or one, not 0

Created on 2020-03-31 by the reprex package (v0.3.0)

Was thinking I can get a Qtryear column and then run this zoo function to get date.

 x <- c("Q1/13", "Q2/14")
as.Date(zoo::as.yearqtr(x, format = "Q%q/%y"))

Appreciate any help in solving this.

Upvotes: 2

Views: 93

Answers (2)

akrun
akrun

Reputation: 887153

case_when and if_else does type check, so all the condition output needs to be of same type. Also, not clear why NULL should be checked on a vector ie. column as NULL would be automatically dropped and it can have an existence in a list env

i.e.

c(NA, NULL, 1:3)
[1] NA  1  2  3

and

list(NULL, NULL, 1:3) 
#[[1]]
#NULL

#[[2]]
#NULL

#[[3]]
#[1] 1 2 3

In the second case, NULL will remain as such


Here, if we are doing the checks, use is.null along with is.na, and make sure the output gets a single type, the q column is numeric (converted to character) while NA by default is logical (so use NA_character_ because the last condition output creates a character string with paste)

library(dplyr)
lsdf$df1 %>% dplyr::mutate(
   Qrt = dplyr::case_when(
    is.na(m) & is.na(q) ~ NA_character_,
    is.na(m) & !is.na(q) ~ as.character(q),
     !is.null(m) & !is.na(q) ~ paste0("Q",ceiling(as.numeric(m)/3)),
      !is.null(m) & !is.null(q) ~ paste0("Q", q)
 ))

Also, as it is a list, use map to loop over the list

library(purrr)
map(lsdf, ~ .x %>% dplyr::mutate(
   Qrt = dplyr::case_when(
    is.na(m) & is.na(q) ~ NA_character_,
    is.na(m) & !is.na(q) ~ as.character(q),
     !is.null(m) & !is.na(q) ~ paste0("Q",ceiling(as.numeric(m)/3)),
      !is.null(m) & !is.null(q) ~ paste0("Q", q)
 )))

Update

If we need the 'qy' column as in the updatedd

library(tidyr)
library(stringr)
library(zoo)
library(lubridate)
map(lsdf, ~ 
          .x %>%
              mutate(q1 = q) %>%
              fill(q, .direction = "downup") %>%
               mutate(qy = case_when(is.na(m) & is.na(q1) ~ NA_character_, 
                       TRUE ~ str_c("Q", q, "/", y))) %>%
               select(-q1)%>% 
               mutate(dy = floor_date(as.Date(as.yearqtr(qy, "Q%q/%Y"), frac = 1), "month"))))

Upvotes: 1

bsuthersan
bsuthersan

Reputation: 118

is this what you were after?

lsdf$df1 %>% 
  mutate(Qrt = case_when(
    !is.na(q) ~ q,
    !is.na(m) & is.na(q) ~ ceiling(as.numeric(m)/3),
    is.na(m) & is.na(q) ~ NA_real_
  )) %>%
  mutate(x = ifelse(is.na(Qrt), NA, paste0(Qrt, "/", y))) %>%
  mutate(x = as.Date(zoo::as.yearqtr(x, format = "%q/%y")))

I cleaned up your case_when a little bit. The issue was that you were trying to combine character and numeric outputs. I've changed the Qrt variable to be numeric. Hope this helps.

Upvotes: 1

Related Questions