Reputation: 7
So I am working on a text analytics problem and I am trying to remove all the numbers between 0 and 999 with regular expression in Python. I have tried Regex Numeric Range Generator to get the regular expression but I didn't succed. I can only remove all the numbers.
I have tried several Regex but it didn't work. here's what I tried
# Remove numbers starting from 0 ==> 999
data_to_clean = re.sub('[^[0-9]{1,3}$]', ' ', data_to_clean)
I have tried this also:
# Remove numbers starting from 0 ==> 999
data_to_clean = re.sub('\b([0-9]|[1-8][0-9]|9[0-9]|[1-8][0-9]{2}|9[0-8][0-9]|99[0-9])\b', ' ', data_to_clean)
this one:
^([0-9]|[1-8][0-9]|9[0-9]|[1-8][0-9]{2}|9[0-8][0-9]|99[0-9])$
and this:
def clean_data(data_to_clean):
# Remove numbers starting from 0 ==> 999
data_to_clean = re.sub('[^[0-9]{1,3}$]', ' ', data_to_clean)
return data_to_clean
I have a lot of numbers but I need to delete just the ones under 3 decimals and keep the other.
Thank You for your help
Upvotes: 0
Views: 404
Reputation: 2695
You need precede the pattern string with an r
to prevent escaping so the interpeter won't swap \b
with a backspace. Plus you can simplify the pattern like this:
data_to_clean = re.sub(r'\b([0-9]|[1-9][0-9]{1,2})\b', ' ', data_to_clean)
Upvotes: 1
Reputation: 4013
Numbers from 0 to 999 are
This gives a naive regex of /\b(?:[0-9]|[1-9][0-9]|[1-9][0-9][0-9])\b/
However we have duplicated characters classes in the options so we can factor them out
/(?!\b0[0-9])\b[0-9]{1,3}\b/
This works by using a negative lookahead (?!\b0[0-9])
to check for the start of a word followed by a 0 followed by a digit to disregard 01 etc. and then looks for 1 to three 0 - 9 characters. Because the negative lookahead needs at least 2 characters a single 0
still passes as valid.
Upvotes: 0
Reputation: 1960
I think you can use a combination of your try with word boundaries (\b
) and your last try ([0-9]{1,3}
).
So the resulting regex should look like: \b[0-9]{1,3}\b
If you check the demo: regex101.com/r/qDrobh/6 It should replace all 1-digit, 2-digit and 3-digit numbers and ignore higher numbers and other words.
Upvotes: 0