Alex
Alex

Reputation: 59

Converting string to date in R

I have column in data.frame with dates in following string format (related to monthly, quarterly and annual data):

"2008Q1", "2008M1", "2008M2", "2008M3", "2008Q2", "2008M4", "2008M5", 
"2008M6", "2008Q3", "2008M7", "2008M8", "2008M9", "2008Q4", "2008M10", 
"2008M11", "2008M12", "2009", "2009Q1", "2009M1", "2009M2", "2009M3", 
"2009Q2", "2009M4", "2009M5", "2009M6", "2009Q3", "2009M7", "2009M8", 
"2009M9", "2009Q4", "2009M10", "2009M11", "2009M12", "2010"

Is there any elegant and fast solution (data.frame is really big) to convert it into two separate colums, containing frequency and the date, like this:

DFreq       Date
Quarterly   1/3/2008
Monthly     1/1/2008
Monthly     1/2/2008
Monthly     1/3/2008
...
Monthly     1/12/2008
Annual      1/12/2009

Upvotes: 1

Views: 460

Answers (3)

alistaire
alistaire

Reputation: 43334

The frequency can be extracted with a little regex, and the strings can be parsed to dates with anytime::anydate (which inserts "01" for missing date components), but it parses all non-year numbers as months, so a little cleanup is necessary. In tidyverse grammar,

library(tidyverse)
library(lubridate)

df <- data_frame(date = c("2008Q1", "2008M1", "2008M2", "2008M3", "2008Q2", "2008M4", "2008M5", 
                          "2008M6", "2008Q3", "2008M7", "2008M8", "2008M9", "2008Q4", "2008M10", 
                          "2008M11", "2008M12", "2009", "2009Q1", "2009M1", "2009M2", "2009M3", 
                          "2009Q2", "2009M4", "2009M5", "2009M6", "2009Q3", "2009M7", "2009M8", 
                          "2009M9", "2009Q4", "2009M10", "2009M11", "2009M12", "2010"))

df %>% 
    mutate(frequency = recode(gsub('\\d', '', date),    # remove all numbers...
                              'M' = 'Monthly',    ...and recode as words
                              'Q' = 'Quarterly', 
                              .default = 'Annually'),
           date = anytime::anydate(date),    # parse to year-month
           date = {month(date) <- month(date) * recode(frequency,    # ...and correct the month
                                                       'Annually' = 12, 
                                                       'Quarterly' = 3, 
                                                       .default = 1); 
                   date})
#> # A tibble: 34 x 2
#>          date frequency
#>        <date>     <chr>
#>  1 2008-03-01 Quarterly
#>  2 2008-01-01   Monthly
#>  3 2008-02-01   Monthly
#>  4 2008-03-01   Monthly
#>  5 2008-06-01 Quarterly
#>  6 2008-04-01   Monthly
#>  7 2008-05-01   Monthly
#>  8 2008-06-01   Monthly
#>  9 2008-09-01 Quarterly
#> 10 2008-07-01   Monthly
#> # ... with 24 more rows

This approach moves adjusts appropriately so as to change quarterly and annual data so that the dates line up with the first day of the last month of the period, as the desired result in the question does. Generally, it's actually much more useful to store the first day of the period, which you can obtain by leveraging the extreme versatility of lubridate::parse_date_time to build a proper parser for the mixed format:

df %>% 
    mutate(frequency = recode(gsub('\\d', '', date),
                              'M' = 'Monthly', 
                              'Q' = 'Quarterly', 
                              .default = 'Annually'),
           date = as_date(parse_date_time(
               date, 
               c('Ym', 'Yq', 'Y'),    # possible formats
               select_formats = function(dates){    # function to determine format
                   recode(gsub('\\%.[a-z]?', '', names(dates)), 
                          'M' = '%YM%m', 
                          'Q' = '%YQ%q', 
                          .default = '%Y')
               })))
#> # A tibble: 34 x 2
#>          date frequency
#>        <date>     <chr>
#>  1 2008-01-01 Quarterly
#>  2 2008-01-01   Monthly
#>  3 2008-02-01   Monthly
#>  4 2008-03-01   Monthly
#>  5 2008-04-01 Quarterly
#>  6 2008-04-01   Monthly
#>  7 2008-05-01   Monthly
#>  8 2008-06-01   Monthly
#>  9 2008-07-01 Quarterly
#> 10 2008-07-01   Monthly
#> # ... with 24 more rows

Upvotes: 3

Benjamin
Benjamin

Reputation: 17369

I won't speak to the efficiency of this, but it gets the job done.

library(stringr)

format_to_date <- function(x){
  year <- str_extract(x, "^\\d{4}")
  if (grepl("M", x)) {
    month <- str_pad(str_extract(x, "\\d{1,2}$"), width = 2)
    paste0(year, "-", month, "-01")
  } else if (grepl("Q", x)) {
    month <- as.numeric(str_extract(x, "\\d{1}$"))
    month <- 1 + (month - 1) * 3
    paste0(year, "-", month, "-01")
  } else{
    paste0(year, "-01-01")
  }
}

Dframe <- 
  data.frame(string = c("2008M5", "2009Q3", "2011"),
             stringsAsFactors = FALSE)

as.Date(vapply(Dframe$string, format_to_date, character(1)))

Efficiency

Because I was curious, I ran these all through microbenchmark and came up with

Unit: milliseconds
      expr       min        lq      mean    median        uq       max neval
  benjamin 432.43466 433.31058 439.30987 439.20125 444.05267 448.95130    10
   pogibas 665.64618 718.50771 734.78987 745.73741 747.14000 767.26852    10
 alistaire  16.85593  17.13333  17.35033  17.31104  17.52041  17.92627    10

So I'd say go with @alistaire's answer.

Upvotes: 1

pogibas
pogibas

Reputation: 28339

Solution similar to Benjamin's (I'm using if/else to grep Quarters or Months) and paste0 to get wanted format.

convertDate <- function(x) {
    DFreq <- "Annual"
    Date <- paste0("1/12/", x)
    foo <- unlist(strsplit(x, "[A-Z]"))
    if (length(grep("Q", x)) == 1) {
        DFreq <- "Quarterly"
        Date <- paste0("1/", as.numeric(foo[2]) * 3, "/", foo[1])
    } else if (length(grep("M", x)) == 1) {
        DFreq <- "Monthly"
        Date <- paste0("1/", foo[2], "/", foo[1])
    }
    return(data.frame(DFreq, Date))
}

INPUT <- c("2008M5", "2009Q3", "2011")
res <- sapply(INPUT, convertDate, simplify = FALSE)
do.call("rbind", res)

           DFreq      Date
2008M5   Monthly  1/5/2008
2009Q3 Quarterly  1/9/2009
2011      Annual 1/12/2011

Upvotes: 1

Related Questions