John
John

Reputation: 43199

Replace a character preceding a number using regex in R

I have a number of column names that can be represented by the following pattern.

dat <- c("Male97","Male98","Male99", "Male100andover","Female0","Female1" ,"Female2", "Female3", "Female4" ,"Female5", "Female100andover")

I am trying add a preceding delimiting character e.g. a dash, between a letter and numeric characters using a regex.

My desired output is, for example, Male-97, or Female-0. However, I do not want the delimiting character inserted after the numeric characters in cases of '100 and over'.

I have tried the following regex:

gsub('([e])[0-9]', '-', dat)

It nearly works. I need something that does not substitute the 'e' with a dash.

Can someone help me along with this please.

Upvotes: 2

Views: 367

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

Your ([e])[0-9] regex matches an captures e followed by a digit, even if the digit is not at the end of the string. Then, you only use - in the replacement, and thus the digit is lost. You could try to use another capturing group with ([0-9]), but it would change the value in Male100andover and suchlike.

You can use a capturing group powered regex like this:

dat <- c("Male97","Male98","Male99", "Male100andover","Female0","Female1" ,"Female2", "Female3", "Female4" ,"Female5", "Female100andover")
gsub("(\\d+)$", "-\\1", dat)

See IDEONE demo.

Explanation:

  • (\\d+) - matches and captures into Group 1 one or more digits that are...
  • $ - at the end of the string.

In the replacement pattern, \1 backreferences the captured digits.

Result:

 [1] "Male-97"          "Male-98"          "Male-99"          "Male100andover"  
 [5] "Female-0"         "Female-1"         "Female-2"         "Female-3"        
 [9] "Female-4"         "Female-5"         "Female100andover"

EDGE CASE HANDLING:

gsub("(\\d+\\D*)$", "-\\1", dat) ## insert before the last digit sequence
## [1] "Male-97"             "Male-98over"         "Male99over-100under"
gsub("^(\\D*)(\\d+)", "\\1-\\2", dat) ## insert before the first digit sequence
## [1] "Male-97"             "Male-98over"         "Male-99over100under"

See another demo

Upvotes: 4

Related Questions