Reputation: 25
I am trying to remove the digits from the text only if they are concatenated with the alphabets or coming between characters in a word. But not with the dates.
Like
if "21st"
then should remain "21st"
But if "alphab24et"
should be "alphabet"
but if the digits come separately like "26 alphabets"
then it should remain "26 alphabets"
.
I am using the below regex
newString = re.sub(r'[0-9]+', '', newString)
, which removes digits in ay position they occur, like in the above example it removes 26 as well.
Upvotes: 1
Views: 81
Reputation: 96
What you should do is add parenthesis so as to define a group and specify that the digits need to be sourounded by strings.
re.sub(r"([^\s\d])\d+([^\s\d])", r'\1\2', newString)
This does match only digits which are between a character other than a space : [^\s] part.
Upvotes: 0
Reputation: 980
I find a way to make my re.sub
's cleaner is to capture the things around my pattern in groups ((...)
below), and put them back in the subsitute pattern (\1
and \2
below).
In your case you want to catch digit sequences ([0-9]+
) that are not surrounded by white spaces (\s
, since you want to keep those) or other other digits ([0-9]
, otherwise the greediness of the algorithm won't remove these): [^\s0-9]
. This gives:
In [1]: re.sub(r"([^\s0-9])[0-9]+([^\s0-9])", r"\1\2", "11 a000b 11 11st x11 11")
Out[1]: '11 ab 11 11st x11 11'
Upvotes: 1
Reputation: 626853
You can match digits that are not enclosed with word boundaries with custom digit boundaries:
import re
newString = 'Like if "21st" then should remain "21st" But if "alphab24et" should be "alphabet" but if the digits come separately like "26 alphabets" then it should remain "26 alphabets" .'
print( re.sub(r'\B(?<!\d)[0-9]+\B(?!\d)', '', newString) )
# => Like if "21st" then should remain "21st" But if "alphabet" should be "alphabet" but if the digits come separately like "26 alphabets" then it should remain "26 alphabets" .
See the Python demo and the regex demo.
Details:
\B(?<!\d)
- a non-word boundary position with no digit immediately on the left[0-9]+
- one or more digits\B(?!\d)
- a non-word boundary position with no digit immediately on the right.Upvotes: 1