CuriousBeing
CuriousBeing

Reputation: 1632

Regex detect codes in R

I have a set of codes I want to check in my dataframe, and if they exist I want to create a column to indicate TRUE or FALSE.

Therefore, some of the codes I have in my datafame: OO14562, MM156789076, AB1234674, HIB00000, POL112310

The dataframe is here:

df<-structure(list(Codes = structure(c(5L, 4L, 1L, 3L, 7L, 8L, 2L, 
6L), .Label = c("AB1234674", "AB13", "HIB00000", "MM156789076", 
"OO14562", "POL1123", "POL112310", "TY543"), class = "factor")), .Names = "Codes", row.names = c(NA, 
-8L), class = "data.frame")

According to the dataframe, the first 5 should return a TRUE, and the next three should be FALSE.

My code is here

gsub([OO|MM|AB|HIB|POL[0-9]{5-9})

But that is not taking me anywhere.

Upvotes: 0

Views: 36

Answers (1)

Mako212
Mako212

Reputation: 7312

One, we need to use parenthesis not brackets to separate the letter sets. Brackets say "match one of" which is going to be unpredictable when paired with pipes. [aa|bb|cc] will actually match a, b, c, or a literal |, which is not the behavior you want.

Two, we'll use grepl because it returns a logical vector, no need to use gsub.

Three, quantity to match is specified in curly braces { }, but min and max are separated by a comma, not a dash.

You could also use [0-9] instead of \\d (any digit), but I like \\d for brevity.

And for completeness, I added ^ and $ to match the beginning and end of the string after the pattern.

This gives us:

df$check <- grepl("^(OO|MM|AB|HIB|POL)\\d{5,9}$", df$Codes)


        Codes check
1     OO14562  TRUE
2 MM156789076  TRUE
3   AB1234674  TRUE
4    HIB00000  TRUE
5   POL112310  TRUE
6       TY543 FALSE
7        AB13 FALSE
8     POL1123 FALSE

Upvotes: 3

Related Questions