ADF
ADF

Reputation: 572

Vectorized conditional string manipulation

I'm trying to make the following vectorized manipulation of date column in my data. I found a very inelegant solution and am sure there is a cleaner tidy solution. Toy example:

index <- c(1,2)
input <- c('11-9-2019', '11/01/2019-01/31/2020')
output <- c('11-9-2019', '11-01-2019')

df_in <- data.frame('index'=index, 'data'=input)

df_out <- data.frame('index' =index, 'data'=output)

I can solve the problem using sapply as follows:

df_out$data <- sapply(range(1:2), function(x) ifelse(str_length(df_in$data[x]) > 12, 
                                          str_sub(df_in$data[x], -10, -1), 
                                                  df_in$data[x]))
df_out$data <- str_replace_all(df_out$data, '/', '-')
df_out$data

Is there any way to do this a) with one vectorized line, b) without relying on string indices like I did in str_sub?

Thanks!

Upvotes: 1

Views: 74

Answers (3)

akrun
akrun

Reputation: 886998

Another option in tidyverse, would be to split the elements with separate_rows and then convert to Date class with lubridate

library(lubridate)
library(dplyr)
library(tidyr)
df_in %>% 
   separate_rows(data, sep="-(?=[0-9]{2}[^0-9])") %>%
   group_by(index) %>%
   slice(1) %>% 
   transmute(data = lubridate::mdy(data)) %>%
   pull(data)
#[1] "2019-11-09" "2019-11-01"

Upvotes: 1

Sotos
Sotos

Reputation: 51582

An idea is to use mdy(month day year) from lubridate after you remove any excess dates, i.e.

lubridate::mdy(ifelse(nchar(df_in$data > 10), substr(df_in$data, 1, 10), df_in$data))
#[1] "2019-11-09" "2019-11-01"

Upvotes: 0

iod
iod

Reputation: 7592

You can do it using gsub:

 gsub("(\\d{1,2})[/-](\\d{1,2})[/-](\\d{4}).*","\\1-\\2-\\3",df_in$data)
 [1] "11-9-2019"  "11-01-2019"

Explanation if you're not familiar with regex:

This searches for a string that has one or two digits ((\\d{1,2})), followed by a slash or a dash ([/-]), then one or two more digits, again a dash or a slash, and then four digits. It replaces these with just the three sets of digits separated by dashes, and removing anything that follows this first string.

Upvotes: 3

Related Questions