LLL
LLL

Reputation: 743

troubleshooting dates with ifelse()

I've got a data frame with a date variable with tens of thousands of entries. I think there may be a data entry mistake somewhere, because I can't turn it into a date class variable or process it in lubridate().

In this MWE the first observation (a1) is a legitimate date in the format I'd expect my dates to be. The other observations (a2-a7) represent different kinds of data entry mistakes. I'd like to test each observation in the date variable, to see if the observation is a legitimate date in the expected format.

I've tried to use regular patterns and ifelse(), but I can't get the code to work. I'd like to end up with something like df2 (although it doesn't have to be a data frame), so that I can easily identify the IDs of any date variable observations that may require attention. Any help would be much appreciated.

Starting point:

df1 <- data.frame(varID=c("a1","a2","a3","a4","a5","a6","a7"),varDate=c("01/01/2015","0101/2016","01/012017","35/01/2018","01/17/2019","01/01/20200","abc"))

Desired outcome:

df2 <- data.frame(varID=c("a2","a3","a4","a5","a6","a7"),VarIssue=c("format issue","format issue","format issue","format issue","format issue","format issue"))

Current code:

ifelse(df1$varDate == (^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)\d\d$),"ok","format issue")

Upvotes: 0

Views: 162

Answers (2)

Rui Barradas
Rui Barradas

Reputation: 76402

Maybe the following is off-topic, but if you are having problems with date formats, consider using package lubridate, its character to Date conversion functions recognize a large number of formats and don't give it up at the first sign of trouble.

library(lubridate)

mdy(df1$varDate)
#[1] "2015-01-01" "2016-01-01" "2017-01-01" NA           "2019-01-17"
#[6] NA           NA          
#Warning message:
# 3 failed to parse.

As you can see only 3 failed to parse. The others were correctly coerced to class Date. Then you would use a much simpler ifelse but the result would obviously be very different.

df3 <- data.frame(varID = df1$varID)
df3$VarIssue <- ifelse(is.na(mdy(df1$varDate)), "format issue", "ok")
df3
#  varID     VarIssue
#1    a1           ok
#2    a2           ok
#3    a3           ok
#4    a4 format issue
#5    a5           ok
#6    a6 format issue
#7    a7 format issue

Only 3 "format issue".

Upvotes: 2

Mako212
Mako212

Reputation: 7292

Two issues, you can't use Regex alone, it needs to be called within a function that accepts a Regex pattern, and you need to double escape pronoun characters.

In R you have to use the double escape like so: \\d, so your pattern becomes:

 pattern <- '^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)\\d\\d$'

The we use grepl (which returns a logical vector) to check each row:

df1$check <-  ifelse(grepl(pattern,df1$varDate)==TRUE,"ok", "format issue")

  varID     varDate        check
1    a1  01/01/2015           ok
2    a2   0101/2016 format issue
3    a3   01/012017 format issue
4    a4  35/01/2018 format issue
5    a5  01/17/2019 format issue
6    a6 01/01/20200 format issue
7    a7         abc format issue

Upvotes: 2

Related Questions