Reputation: 743
I've got a data frame with a date variable with tens of thousands of entries. I think there may be a data entry mistake somewhere, because I can't turn it into a date class variable or process it in lubridate().
In this MWE the first observation (a1) is a legitimate date in the format I'd expect my dates to be. The other observations (a2-a7) represent different kinds of data entry mistakes. I'd like to test each observation in the date variable, to see if the observation is a legitimate date in the expected format.
I've tried to use regular patterns and ifelse(), but I can't get the code to work. I'd like to end up with something like df2 (although it doesn't have to be a data frame), so that I can easily identify the IDs of any date variable observations that may require attention. Any help would be much appreciated.
Starting point:
df1 <- data.frame(varID=c("a1","a2","a3","a4","a5","a6","a7"),varDate=c("01/01/2015","0101/2016","01/012017","35/01/2018","01/17/2019","01/01/20200","abc"))
Desired outcome:
df2 <- data.frame(varID=c("a2","a3","a4","a5","a6","a7"),VarIssue=c("format issue","format issue","format issue","format issue","format issue","format issue"))
Current code:
ifelse(df1$varDate == (^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)\d\d$),"ok","format issue")
Upvotes: 0
Views: 162
Reputation: 76402
Maybe the following is off-topic, but if you are having problems with date formats, consider using package lubridate
, its character
to Date
conversion functions recognize a large number of formats and don't give it up at the first sign of trouble.
library(lubridate)
mdy(df1$varDate)
#[1] "2015-01-01" "2016-01-01" "2017-01-01" NA "2019-01-17"
#[6] NA NA
#Warning message:
# 3 failed to parse.
As you can see only 3 failed to parse.
The others were correctly coerced to class Date
. Then you would use a much simpler ifelse
but the result would obviously be very different.
df3 <- data.frame(varID = df1$varID)
df3$VarIssue <- ifelse(is.na(mdy(df1$varDate)), "format issue", "ok")
df3
# varID VarIssue
#1 a1 ok
#2 a2 ok
#3 a3 ok
#4 a4 format issue
#5 a5 ok
#6 a6 format issue
#7 a7 format issue
Only 3 "format issue"
.
Upvotes: 2
Reputation: 7292
Two issues, you can't use Regex alone, it needs to be called within a function that accepts a Regex pattern, and you need to double escape pronoun characters.
In R you have to use the double escape like so: \\d
, so your pattern becomes:
pattern <- '^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)\\d\\d$'
The we use grepl
(which returns a logical vector) to check each row:
df1$check <- ifelse(grepl(pattern,df1$varDate)==TRUE,"ok", "format issue")
varID varDate check
1 a1 01/01/2015 ok
2 a2 0101/2016 format issue
3 a3 01/012017 format issue
4 a4 35/01/2018 format issue
5 a5 01/17/2019 format issue
6 a6 01/01/20200 format issue
7 a7 abc format issue
Upvotes: 2