Vaibhav
Vaibhav

Reputation: 338

Remove the numeric portion of alphanumeric strings but keep the pure numbers

I'm trying to clean some strings which contain a combination of letters and numbers

a <- c("Hello World","Hello4 World","12345","Hello World 4","4Hello World5","Hello 4", "Hello4")

I'm trying to remove the numeric portion of the alphanumeric strings but retain the pure numbers or when the number is separated by space, the output I'm looking for is.

b <- c("Hello World","Hello World","12345","Hello World 4","Hello World", "Hello 4","Hello")

The strings could be anything and not necessarily 'Hello' or 'World', I've tried various regex combinations but couldn't get what i wanted.

Any help would be appreciated!

Upvotes: 0

Views: 117

Answers (2)

Onyambu
Onyambu

Reputation: 79348

gsub('(?i)(?<=[a-z])\\d+|\\d+(?=[a-z])','',a,perl=T)
[1] "Hello World"   "Hello World"   "12345"         "Hello World 4" "Hello World"   "Hello 4"       "Hello"   

Explanation:

  • ?i is used to IGNORE CASES. ie you can also use the argument ignore.case = TRUE

  • (?<=[a-z])\\d+ This is a lookbehind whereby you are looking for digit(s) ie \\d+ immediately preceded by a letter(?<=[a-z])`

  • | or

  • \\d+(?=[a-z]) this is a lookahead whereby you look for a digit(s) \\d+ immediately followed by a letter (?=[a-z]).

Substitute this with an empty string. ie replacement ='' is the second argument of the gsub function

gsub('([a-z])\\d+|\\d+([a-z])','\\1\\2',a,ignore.case = T)
[1] "Hello World"   "Hello World"   "12345"         "Hello World 4" "Hello World"   "Hello 4"       "Hello" 

This follows almost the same trick but instead of using lookarounds, we use backreferencing.

  • ([a-z])\\d+capture the letter that is immediately before a digit(s) as group 1
  • |\\d+([a-z]) capture the letter that immediately follows the digits as group 2

Now replace the whole expression with the captured letters ie \\1\\2

You can mix the two regular expressions as you want.

Upvotes: 2

nbirla
nbirla

Reputation: 610

Make use of regex after splitting the input by space

[A-Za-z] - all letters 

^[0-9] - all digits

Upvotes: 0

Related Questions