Carrie Smith
Carrie Smith

Reputation: 41

R regex to selectively replace characters only at specific string positions

I'm error checking Canadian postal codes in the format A1A1A1. Common typos are capital O instead of zeros in positions 2, 4 or 6, which should be replaced by a zero.

I'm fairly new to regex, and this one has me stumped. Thanks so much!

Upvotes: 3

Views: 703

Answers (4)

G. Grothendieck
G. Grothendieck

Reputation: 270195

1) Using gsubfn we can do this with a particularly simple regular expression. Note that gsubfn allows the function in the second argument to be specified using a formula notation. Here it is regarded as a function of x and y with the indicated body:

library(gsubfn)
gsubfn("(.)(.)", ~ paste0(x, chartr("O", "0", y)), "O0OO1A")
## [1] "O0O01A"

Note that this works with positions and does not depend on the position before the numeric position being a letter so it works even if the prior letter was incorrectly coded as a number, e.g. oh as zero.

2) The above readily generalizes to convert ohs to zeros in even positions and zeros to ohs in odd positions. The regular expression stays the same and only the function specified in the second argument changes:

ohzero <- function(x, y) paste0(chartr("0", "O", x), chartr("O", "0", y))
gsubfn("(.)(.)", ohzero, "O00O1A")
## [1] "O0O01A"

3) or to do that plus convert ones to eyes (I) and eyes to ones use this function instead of ohzero

function(x, y) paste0(chartr("01", "OI", x), chartr("OI", "01", y))

Upvotes: 2

Avinash Raj
Avinash Raj

Reputation: 174836

Use the below regex in gsub function and then replace all the matched characters with 0

(?<=^.)O|(?<=^.{3})O|(?<=^.{5})O

DEMO

OR

You could use the PCRE verb (*SKIP)(*F). This only replaces the letter O in 2,4,6 positions with zero 0. It won't care about the letters or numbers present in other positions.

> x <- c('AOAOAO', 'O2O3O2', 'BOB1B2', 'C1COC3')
> gsub("(?:(?<=^).|(?<=^..).|(?<=^....).)(*SKIP)(*F)|O", "0", x, perl=TRUE)
[1] "A0A0A0" "O2O3O2" "B0B1B2" "C1C0C3"

DEMO

Upvotes: 0

flodel
flodel

Reputation: 89097

You can do

x <- c("A0A0A0", "AOB0C0", "A0BOC0", "A0B0CO", "OOOOOO")

gsub("([A-Z])O", "\\10", x)
# [1] "A0A0A0" "A0B0C0" "A0B0C0" "A0B0C0" "O0O0O0"

A bit of explanation:

  • [A-Z] is any character from A to Z
  • the parentheses ([A-Z]) are here to capture the character so it can be referenced as \\1 in the replacement
  • ([A-Z])O is a character from A to Z followed by a O
  • \\1 is the captured character from A to Z
  • \\10 is the captured character followed by a 0

Upvotes: 5

hwnd
hwnd

Reputation: 70742

If the format is always like that, you could use gsub to replace the mistaken "O" characters.

x <- c('A1A1A1', 'AOAOAO', 'A0B0CO', 'AOBOC0')
gsub('[A-Z]\\KO', '0', x, perl=T)
# [1] "A1A1A1" "A0A0A0" "A0B0C0" "A0B0C0"

Upvotes: 2

Related Questions