missanita
missanita

Reputation: 256

Remove all punctuation AND the values after it at end of string in R

I have a ID variable that comes from 35 different hospitals, so has varying different arrangements of the variable, and sometimes it has the same root ID number with a secondary line number - e.g. -1, /a, _1 etc.

I want to remove the punctuation, and whatever comes after that punctuation, leaving just the root ID number.

I have currently managed to write out individual lines of code for each different iteration, but I was wondering if there was a more elegant way so that next year when the data comes in I don't need to check for different arrangements?

On someone else's question I managed to find a way to remove the brackets and all the text within the brackets, but I can't seem to figure out how to manipulate it for my purposes

df$patid<- gsub("\\s*\\([^\\)]+\\)","",df$patid)

I tried these two codes without success

df$patid<- gsub("\\[:punct:]s*$","", df$patid)
df$patid<- gsub("\\[:alnum:]s*$","", df$patid)

I also tried the clean function, which removed all the punctuation, but kept the numbers/characters after them, so that wasn't it.

example of my current code (not all possible iterations) - These do work

df$patid<- gsub("\\-1$", "", df$patid)
df$patid<- gsub("\\-2$", "", df$patid)
df$patid<- gsub("\\-3$", "", df$patid)
df$patid<- gsub("\\-a$", "", df$patid)
df$patid<- gsub("\\-A$", "", df$patid)
df$patid<- gsub("\\-b$", "", df$patid)
df$patid<- gsub("\\-B$", "", df$patid)
df$patid<- gsub("\\b", "", df$patid)
df$patid<- gsub("\\/dd", "", df$patid)

Am not tied to gsub, am open to different methods.

Example of ID numbers

patid<- c("MB-13-169454", "MB-13-179455", "MB-13-212235.1", "MB-13-212235.2", "MB-13-224683", "570548260-2", "570548260-3", "1458629P-2", "1139093D-2", "8253015N/2", "8253015N/3", "M255858/1", "M255858/2", "8494392Q/2", "9296741B/2", "04152341421/A", "04152341421/B", "04152640475/B", "04152821164/A", "G140381883_1", "G140381883_2", "G140880774_1", "G140880774_2")

Apologies if this has been answered somewhere already

Upvotes: 6

Views: 411

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627341

What you ask for is to remove any punctuation and then one or two alphanumeric characters at the end of the string.

gsub("[[:punct:]][[:alnum:]]{1,2}$", "", x)

See the R demo. The [[:punct:]][[:alnum:]]{1,2}$ TRE compliant pattern matches a punctuation character ([[:punct:]]), then one or two alphanumerics ([[:alnum:]]{1,2}), and then asserts if there is an end of string ($) right after that alphanumeric char. See the regex demo.

To remove any punctuation AND the text after it at end of string, you can use

gsub("[\\p{S}\\p{P}]+[^\\p{S}\\p{P}]*$", "", x, perl=TRUE)

NOTE: You can also use the same pattern with stringr::str_replace_all function. Also, you must use perl=TRUE in gsub to make this pattern work since it is PCRE compliant, not TRE-compliant.

See the regex demo.

Details:

  • [\p{S}\p{P}]+ - one or more math symbols or punctuation proper characters (note that the default engine uses a POSIX compliant version of [:punct:] that includes these two Unicode category classes, but ICU regex engine used in stringr regex functions is not POSIX compliant and behaves differently, that is why I am suggesting this pattern)
  • [^\p{S}\p{P}]* - zero or more characters other than math symbols or punctuation proper characters
  • $ - end of string.

See the R demo online:

patid <- c("MB-13-169454", "MB-13-179455", "MB-13-212235.1", "MB-13-212235.2", "MB-13-224683", "570548260-2", "570548260-3", "1458629P-2", "1139093D-2", "8253015N/2", "8253015N/3", "M255858/1", "M255858/2", "8494392Q/2", "9296741B/2", "04152341421/A", "04152341421/B", "04152640475/B", "04152821164/A", "G140381883_1", "G140381883_2", "G140880774_1", "G140880774_2")

gsub("[\\p{S}\\p{P}]+[^\\p{S}\\p{P}]*$", "", patid, perl=TRUE)

Output:

 [1] "MB-13"        "MB-13"        "MB-13-212235" "MB-13-212235" "MB-13"       
 [6] "570548260"    "570548260"    "1458629P"     "1139093D"     "8253015N"    
[11] "8253015N"     "M255858"      "M255858"      "8494392Q"     "9296741B"    
[16] "04152341421"  "04152341421"  "04152640475"  "04152821164"  "G140381883"  
[21] "G140381883"   "G140880774"   "G140880774"  

Additional info that you may be confused about:

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522571

A literal regex for what you described would be:

[[:punct:]][^[:punct:]]*$

This would match a final punctuation character, followed by anything which follows it, until the end of the string.

patid <- c("MB-13-169454", "MB-13-179455", "MB-13-212235.1", "MB-13-212235.2", "MB-13-224683", "570548260-2", "570548260-3", "1458629P-2", "1139093D-2", "8253015N/2", "8253015N/3", "M255858/1", "M255858/2", "8494392Q/2", "9296741B/2", "04152341421/A", "04152341421/B", "04152640475/B", "04152821164/A", "G140381883_1", "G140381883_2", "G140880774_1", "G140880774_2")
output <- sub("[[:punct:]][^[:punct:]]*$", "", patid)
output

 [1] "MB-13-169454" "MB-13-179455" "MB-13-212235" "MB-13-212235" "MB-13-224683"
 [6] "570548260"    "570548260"    "1458629P"     "1139093D"     "8253015N"    
[11] "8253015N"     "M255858"      "M255858"      "8494392Q"     "9296741B"    
[16] "04152341421"  "04152341421"  "04152640475"  "04152821164"  "G140381883"  
[21] "G140381883"   "G140880774"   "G140880774"  

Upvotes: 8

Related Questions