Reputation: 59
I have column in data.frame with dates in following string format (related to monthly, quarterly and annual data):
"2008Q1", "2008M1", "2008M2", "2008M3", "2008Q2", "2008M4", "2008M5",
"2008M6", "2008Q3", "2008M7", "2008M8", "2008M9", "2008Q4", "2008M10",
"2008M11", "2008M12", "2009", "2009Q1", "2009M1", "2009M2", "2009M3",
"2009Q2", "2009M4", "2009M5", "2009M6", "2009Q3", "2009M7", "2009M8",
"2009M9", "2009Q4", "2009M10", "2009M11", "2009M12", "2010"
Is there any elegant and fast solution (data.frame is really big) to convert it into two separate colums, containing frequency and the date, like this:
DFreq Date
Quarterly 1/3/2008
Monthly 1/1/2008
Monthly 1/2/2008
Monthly 1/3/2008
...
Monthly 1/12/2008
Annual 1/12/2009
Upvotes: 1
Views: 460
Reputation: 43334
The frequency can be extracted with a little regex, and the strings can be parsed to dates with anytime::anydate
(which inserts "01" for missing date components), but it parses all non-year numbers as months, so a little cleanup is necessary. In tidyverse grammar,
library(tidyverse)
library(lubridate)
df <- data_frame(date = c("2008Q1", "2008M1", "2008M2", "2008M3", "2008Q2", "2008M4", "2008M5",
"2008M6", "2008Q3", "2008M7", "2008M8", "2008M9", "2008Q4", "2008M10",
"2008M11", "2008M12", "2009", "2009Q1", "2009M1", "2009M2", "2009M3",
"2009Q2", "2009M4", "2009M5", "2009M6", "2009Q3", "2009M7", "2009M8",
"2009M9", "2009Q4", "2009M10", "2009M11", "2009M12", "2010"))
df %>%
mutate(frequency = recode(gsub('\\d', '', date), # remove all numbers...
'M' = 'Monthly', ...and recode as words
'Q' = 'Quarterly',
.default = 'Annually'),
date = anytime::anydate(date), # parse to year-month
date = {month(date) <- month(date) * recode(frequency, # ...and correct the month
'Annually' = 12,
'Quarterly' = 3,
.default = 1);
date})
#> # A tibble: 34 x 2
#> date frequency
#> <date> <chr>
#> 1 2008-03-01 Quarterly
#> 2 2008-01-01 Monthly
#> 3 2008-02-01 Monthly
#> 4 2008-03-01 Monthly
#> 5 2008-06-01 Quarterly
#> 6 2008-04-01 Monthly
#> 7 2008-05-01 Monthly
#> 8 2008-06-01 Monthly
#> 9 2008-09-01 Quarterly
#> 10 2008-07-01 Monthly
#> # ... with 24 more rows
This approach moves adjusts appropriately so as to change quarterly and annual data so that the dates line up with the first day of the last month of the period, as the desired result in the question does. Generally, it's actually much more useful to store the first day of the period, which you can obtain by leveraging the extreme versatility of lubridate::parse_date_time
to build a proper parser for the mixed format:
df %>%
mutate(frequency = recode(gsub('\\d', '', date),
'M' = 'Monthly',
'Q' = 'Quarterly',
.default = 'Annually'),
date = as_date(parse_date_time(
date,
c('Ym', 'Yq', 'Y'), # possible formats
select_formats = function(dates){ # function to determine format
recode(gsub('\\%.[a-z]?', '', names(dates)),
'M' = '%YM%m',
'Q' = '%YQ%q',
.default = '%Y')
})))
#> # A tibble: 34 x 2
#> date frequency
#> <date> <chr>
#> 1 2008-01-01 Quarterly
#> 2 2008-01-01 Monthly
#> 3 2008-02-01 Monthly
#> 4 2008-03-01 Monthly
#> 5 2008-04-01 Quarterly
#> 6 2008-04-01 Monthly
#> 7 2008-05-01 Monthly
#> 8 2008-06-01 Monthly
#> 9 2008-07-01 Quarterly
#> 10 2008-07-01 Monthly
#> # ... with 24 more rows
Upvotes: 3
Reputation: 17369
I won't speak to the efficiency of this, but it gets the job done.
library(stringr)
format_to_date <- function(x){
year <- str_extract(x, "^\\d{4}")
if (grepl("M", x)) {
month <- str_pad(str_extract(x, "\\d{1,2}$"), width = 2)
paste0(year, "-", month, "-01")
} else if (grepl("Q", x)) {
month <- as.numeric(str_extract(x, "\\d{1}$"))
month <- 1 + (month - 1) * 3
paste0(year, "-", month, "-01")
} else{
paste0(year, "-01-01")
}
}
Dframe <-
data.frame(string = c("2008M5", "2009Q3", "2011"),
stringsAsFactors = FALSE)
as.Date(vapply(Dframe$string, format_to_date, character(1)))
Because I was curious, I ran these all through microbenchmark
and came up with
Unit: milliseconds
expr min lq mean median uq max neval
benjamin 432.43466 433.31058 439.30987 439.20125 444.05267 448.95130 10
pogibas 665.64618 718.50771 734.78987 745.73741 747.14000 767.26852 10
alistaire 16.85593 17.13333 17.35033 17.31104 17.52041 17.92627 10
So I'd say go with @alistaire's answer.
Upvotes: 1
Reputation: 28339
Solution similar to Benjamin's (I'm using if/else
to grep Quarters or Months) and paste0
to get wanted format.
convertDate <- function(x) {
DFreq <- "Annual"
Date <- paste0("1/12/", x)
foo <- unlist(strsplit(x, "[A-Z]"))
if (length(grep("Q", x)) == 1) {
DFreq <- "Quarterly"
Date <- paste0("1/", as.numeric(foo[2]) * 3, "/", foo[1])
} else if (length(grep("M", x)) == 1) {
DFreq <- "Monthly"
Date <- paste0("1/", foo[2], "/", foo[1])
}
return(data.frame(DFreq, Date))
}
INPUT <- c("2008M5", "2009Q3", "2011")
res <- sapply(INPUT, convertDate, simplify = FALSE)
do.call("rbind", res)
DFreq Date
2008M5 Monthly 1/5/2008
2009Q3 Quarterly 1/9/2009
2011 Annual 1/12/2011
Upvotes: 1