gsub specific pattern and position in character string

Question

This is probably a fairly easy fix, but I'm not as good w the RegExpr as would be ideal, so help is appreciated. I have looked elsewhere and nothing is working for me.

I am trying to standardize some names of university degrees. I need the following format:

Degree Code - Major Name EG - "BA - Computer Stuff"

IE a word, single space, dash, single space, word.

It does not recognize multiple spaces on one or both sides of the dash, and if it sees no spaces, it replaces the letters on either side of the dash with lowercase s, where I thought that \s or \s white space and it would substitute.

This one bit of format fixing is part of a larger mutate statement, IE a single line with brackets ala the ve example elsewhere will not work for me.

I have example data:

data <- data.frame( var = c("BA-English" , "BA - English" , "BA -  Chemistry" , "BS  -  Rubber Chickens") )

    var %>%
      mutate(var = gsub("\w\S-\S\w", "\w\s-\s\w", var) ) -> var_fix )

Any help is very much appreciated. Thank you

Wiktor Stribiżew · Accepted Answer

You can use

gsub("\s*-\s*", " - ", var)
## Or, if the hyphen is in between word chars
gsub("\b\s*-\s*\b", " - ", var)

See the regex demo #1 and regex demo #2.

Details:

\b - a word boundary
\s* - zero or more whitespaces
- - a hyphen

Note: in case you want to normalize hyphens, you can also consider using gsub("(*UCP)\s*[\p{Pd}\x{00AD}\x{2212}]\s*", " - ", var, perl=TRUE) / gsub("(*UCP)\b\s*[\p{Pd}\x{00AD}\x{2212}]\s*\b", " - ", var, perl=TRUE), where (*UCP) makes the word boundary and whitespace patterns Unicode-aware, \p{Pd} matches any Unicode dash, \x{00AD} matches a soft hyphen and \x{2212} matches a minus symbol.

gsub specific pattern and position in character string

Answers (1)

Related Questions