Francis Smart
Francis Smart

Reputation: 4055

Regex Group Backref non-matching value

I am doing the common task of trying to grab dates from string entries with inconsistent formatting which also have numbers which look much like dates. An important formatting consistency that exists in most dates is that the deliminator between numbers is consistent.

library(stringr)
library(dplyr)

dat1 = c("01-25-2019", "15 01 2018", "01.16.2018", "01.24 2018", "01.22 19 PSI", "10.19 PSI", "01.01.01")

dat1 %>% str_extract("[0-9]{1,4}([- /\\.])[0-9]{1,4}(\\1[0-9]{1,4}|)")
# [1] "01-25-2019" "15 01 2018" "01.16.2018" "01.24"      "01.22"      "10.19"      "01.01.01"  

Back-referencing seems effective at imposing consistency with deliminators. What I would like to also be able to do is to backreference for non-matches such that if a year is matched in one location 201[5-9]" it cannot be matched in another location. Likewise with month or day. At times I would also need to allow the year to be implied by context. That is what the last group(...|)` is doing.

The following is my attempt using ^ as a match nulifier.

dat1 %>% str_extract("([0-3][0-9]|[0-3][0-9]|(201[5-9]|1[5-9]))([ /\\.])(^\\1)(\\3(^\\1)|)")

# [1] NA         NA         NA         NA         NA         NA         NA

Upvotes: 0

Views: 60

Answers (1)

user10191355
user10191355

Reputation:

I'm not sure about using backreferences in this case, but using a lookahead might make sense if the formatting isn't always consistent. Using your data + "01.22.19 PSI" and "01.24 2018 19 PSI" as extra test cases:

dat1 = c("01-25-2019", "15 01 2018", "01.16.2018", "01.24 2018", "01.24 2018 19 PSI", "01.22 19 PSI", "10.19 PSI", "01.01.01", "01.22.19 PSI")

Important is the last group, which looks for 2-4 digit numbers separated by space, hyphen, or period if followed by the end of the line or a space plus another digit. Otherwise the final separator must be a hyphen or period:

str_extract(dat1, "\\d{2}[-\\. ]\\d{2}([-\\. ]\\d{2,4}(?= \\d|$)|[-\\.]\\d{2,4})?")

#### OUTPUT ####
[1] "01-25-2019" "15 01 2018" "01.16.2018" "01.24 2018" "01.24 2018" "01.22"      "10.19"      "01.01.01"   "01.22.19" 

The obvious benefit is that it can also work with inconsistent formatting such as "01.24 2018" and "01.24 2018 19 PSI". It might still need some fine tuning, but I think it should be fairly straightforward to build on this principle.

Another, simpler, approach that I frequently use is to eliminate obvious non-matches first. For example, it might be easier to first remove PSI preceded by some digits, and only then to look for the dates.

Upvotes: 1

Related Questions