Maximilian
Maximilian

Reputation: 4229

Correct wrongly formatted dates

I have some incorrect dates between good formatted dates, looking something like this:

df <- data.frame(col=c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05"))

How can I convert the incorrect format between the existing correctly formatted dates?

I'm able to remove the first dashes, but also the it requires to remove the last 3 characters -01 or -1. So that the corrected values are:

desired <- c("1.1.11","1.1.12","1.1.13","1.1.14","1.10.10","1.10.11","1.10.12","2010-03-31","2010-04-01","2010-04-05"))

What I'm strangling with is the -01 part, since by removing these, would also remove part of the correct formatted dates.

EDIT: The format is mm.dd.yy

Upvotes: 0

Views: 70

Answers (4)

hwnd
hwnd

Reputation: 70732

Here is a pretty simple solution using sub ...

sub('^-+([^-]+).+', '\\1', df$col)
# [1] "1.1.11"     "1.11.12"    "1.1.13"     "1.1.14"     "1.10.10"   
# [6] "1.10.11"    "1.10.12"    "2010-03-31" "2010-04-01" "2010-04-05"

Upvotes: 4

Avinash Raj
Avinash Raj

Reputation: 174776

Just remove all the non-word characters present at the start or -01 or -1 present at the end which was not preceded by -+ two digits.

> x <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> gsub("^\\W+|(?<!-\\d{2})-0?1$", "", x, perl=T)
 [1] "1.1.11"     "1.11.12"    "1.1.13"     "1.1.14"     "1.10.10"   
 [6] "1.10.11"    "1.10.12"    "2010-03-31" "2010-04-01" "2010-04-05"

Upvotes: 3

LauriK
LauriK

Reputation: 1929

A simple regexp will solve these kinds of problems pretty well:

> df <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> df
 [1] "--1.1.11-01"   "--1.11.12-1"   "--1.1.13-01"   "--1.1.14-01"   "--1.10.10-01"  "-1.10.11-01"   "---1.10.12-01"
 [8] "2010-03-31"    "2010-04-01"    "2010-04-05"   
> df <- sub(".*([0-9]{4}\\-[0-9]{2}\\-[0-9]{2}|[0-9]{1,2}\\.[0-9]{1,2}\\.[0-9]{1,2}).*", "\\1", df)
> df
 [1] "1.1.11"     "1.11.12"    "1.1.13"     "1.1.14"     "1.10.10"    "1.10.11"    "1.10.12"    "2010-03-31" "2010-04-01"
[10] "2010-04-05"

Note that I made it a character vector instead of data.frame.

The solution itself is just matching one pattern or the other pattern and then dropping the rest by replacing it with the subpattern.

Upvotes: 1

Chetan
Chetan

Reputation: 5095

I here observe that if the prefix of a date has an entry as -1 or --1 then only there exists a illegal suffix i.e -01. You could first take all the values in array. So you will have an array of "--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01"

Now you can check for the prefix if is it -1 or --1. if there exists any such thing then you can mark it as to remove the suffix -01 as well . According to the input pattern above I feel that the above strategy would work.

Please let me know if the strategy works

Upvotes: 0

Related Questions