Reputation: 41
I'm error checking Canadian postal codes in the format A1A1A1. Common typos are capital O instead of zeros in positions 2, 4 or 6, which should be replaced by a zero.
I'm fairly new to regex, and this one has me stumped. Thanks so much!
Upvotes: 3
Views: 703
Reputation: 270195
1) Using gsubfn
we can do this with a particularly simple regular expression. Note that gsubfn
allows the function in the second argument to be specified using a formula notation. Here it is regarded as a function of x
and y
with the indicated body:
library(gsubfn)
gsubfn("(.)(.)", ~ paste0(x, chartr("O", "0", y)), "O0OO1A")
## [1] "O0O01A"
Note that this works with positions and does not depend on the position before the numeric position being a letter so it works even if the prior letter was incorrectly coded as a number, e.g. oh as zero.
2) The above readily generalizes to convert ohs to zeros in even positions and zeros to ohs in odd positions. The regular expression stays the same and only the function specified in the second argument changes:
ohzero <- function(x, y) paste0(chartr("0", "O", x), chartr("O", "0", y))
gsubfn("(.)(.)", ohzero, "O00O1A")
## [1] "O0O01A"
3) or to do that plus convert ones to eyes (I) and eyes to ones use this function instead of ohzero
function(x, y) paste0(chartr("01", "OI", x), chartr("OI", "01", y))
Upvotes: 2
Reputation: 174836
Use the below regex in gsub function and then replace all the matched characters with 0
(?<=^.)O|(?<=^.{3})O|(?<=^.{5})O
OR
You could use the PCRE verb (*SKIP)(*F)
. This only replaces the letter O
in 2,4,6 positions with zero 0
. It won't care about the letters or numbers present in other positions.
> x <- c('AOAOAO', 'O2O3O2', 'BOB1B2', 'C1COC3')
> gsub("(?:(?<=^).|(?<=^..).|(?<=^....).)(*SKIP)(*F)|O", "0", x, perl=TRUE)
[1] "A0A0A0" "O2O3O2" "B0B1B2" "C1C0C3"
Upvotes: 0
Reputation: 89097
You can do
x <- c("A0A0A0", "AOB0C0", "A0BOC0", "A0B0CO", "OOOOOO")
gsub("([A-Z])O", "\\10", x)
# [1] "A0A0A0" "A0B0C0" "A0B0C0" "A0B0C0" "O0O0O0"
A bit of explanation:
[A-Z]
is any character from A
to Z
([A-Z])
are here to capture the character so it can be referenced as \\1
in the replacement([A-Z])O
is a character from A
to Z
followed by a O
\\1
is the captured character from A
to Z
\\10
is the captured character followed by a 0
Upvotes: 5
Reputation: 70742
If the format is always like that, you could use gsub
to replace the mistaken "O" characters.
x <- c('A1A1A1', 'AOAOAO', 'A0B0CO', 'AOBOC0')
gsub('[A-Z]\\KO', '0', x, perl=T)
# [1] "A1A1A1" "A0A0A0" "A0B0C0" "A0B0C0"
Upvotes: 2