Namra Rehman
Namra Rehman

Reputation: 25

Remove digits from the string if they are concatenated using Regex

I am trying to remove the digits from the text only if they are concatenated with the alphabets or coming between characters in a word. But not with the dates.

Like if "21st" then should remain "21st" But if "alphab24et" should be "alphabet" but if the digits come separately like "26 alphabets"
then it should remain "26 alphabets" .

I am using the below regex newString = re.sub(r'[0-9]+', '', newString)

, which removes digits in ay position they occur, like in the above example it removes 26 as well.

Upvotes: 1

Views: 81

Answers (3)

Said Taghadouini
Said Taghadouini

Reputation: 96

What you should do is add parenthesis so as to define a group and specify that the digits need to be sourounded by strings.

re.sub(r"([^\s\d])\d+([^\s\d])", r'\1\2', newString)

This does match only digits which are between a character other than a space : [^\s] part.

Upvotes: 0

Thrastylon
Thrastylon

Reputation: 980

I find a way to make my re.sub's cleaner is to capture the things around my pattern in groups ((...) below), and put them back in the subsitute pattern (\1 and \2 below).

In your case you want to catch digit sequences ([0-9]+) that are not surrounded by white spaces (\s, since you want to keep those) or other other digits ([0-9], otherwise the greediness of the algorithm won't remove these): [^\s0-9]. This gives:

In [1]: re.sub(r"([^\s0-9])[0-9]+([^\s0-9])", r"\1\2", "11 a000b 11 11st x11 11")
Out[1]: '11 ab 11 11st x11 11'

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626853

You can match digits that are not enclosed with word boundaries with custom digit boundaries:

import re
newString = 'Like if "21st" then should remain "21st" But if  "alphab24et" should be  "alphabet" but if the digits come separately like  "26 alphabets" then it should remain  "26 alphabets" .'
print( re.sub(r'\B(?<!\d)[0-9]+\B(?!\d)', '', newString) )
# => Like if "21st" then should remain "21st" But if  "alphabet" should be  "alphabet" but if the digits come separately like  "26 alphabets" then it should remain  "26 alphabets" .

See the Python demo and the regex demo.

Details:

  • \B(?<!\d) - a non-word boundary position with no digit immediately on the left
  • [0-9]+ - one or more digits
  • \B(?!\d) - a non-word boundary position with no digit immediately on the right.

Upvotes: 1

Related Questions