Reputation: 572
I'm trying to make the following vectorized manipulation of date column in my data. I found a very inelegant solution and am sure there is a cleaner tidy solution. Toy example:
index <- c(1,2)
input <- c('11-9-2019', '11/01/2019-01/31/2020')
output <- c('11-9-2019', '11-01-2019')
df_in <- data.frame('index'=index, 'data'=input)
df_out <- data.frame('index' =index, 'data'=output)
I can solve the problem using sapply as follows:
df_out$data <- sapply(range(1:2), function(x) ifelse(str_length(df_in$data[x]) > 12,
str_sub(df_in$data[x], -10, -1),
df_in$data[x]))
df_out$data <- str_replace_all(df_out$data, '/', '-')
df_out$data
Is there any way to do this a) with one vectorized line, b) without relying on string indices like I did in str_sub?
Thanks!
Upvotes: 1
Views: 74
Reputation: 886998
Another option in tidyverse
, would be to split the elements with separate_rows
and then convert to Date
class with lubridate
library(lubridate)
library(dplyr)
library(tidyr)
df_in %>%
separate_rows(data, sep="-(?=[0-9]{2}[^0-9])") %>%
group_by(index) %>%
slice(1) %>%
transmute(data = lubridate::mdy(data)) %>%
pull(data)
#[1] "2019-11-09" "2019-11-01"
Upvotes: 1
Reputation: 51582
An idea is to use mdy
(month day year) from lubridate
after you remove any excess dates, i.e.
lubridate::mdy(ifelse(nchar(df_in$data > 10), substr(df_in$data, 1, 10), df_in$data))
#[1] "2019-11-09" "2019-11-01"
Upvotes: 0
Reputation: 7592
You can do it using gsub
:
gsub("(\\d{1,2})[/-](\\d{1,2})[/-](\\d{4}).*","\\1-\\2-\\3",df_in$data)
[1] "11-9-2019" "11-01-2019"
Explanation if you're not familiar with regex:
This searches for a string that has one or two digits ((\\d{1,2})
), followed by a slash or a dash ([/-]
), then one or two more digits, again a dash or a slash, and then four digits. It replaces these with just the three sets of digits separated by dashes, and removing anything that follows this first string.
Upvotes: 3