Reputation: 55
I have a string which is comprised of a set of numbers and a URL. I only need all numeric characters except the ones attached to the URL. Below is my code to remove all non-numeric characters but it doesn't remove the numbers from the URL.
test = '4758 11b98https://www.website11/111'
re.sub("[^0-9]","",test)
expected result: 47581198
Upvotes: 0
Views: 113
Reputation: 163632
You could match a string that contains https:// or http:// to not capture digits attached to it, and use an alternation |
to capture the other digits in group 1.
Then in the output, join all the digits from group 1 with an empty string.
https?://\S+|(\d+)
For example
import re
pattern = r"https?://\S+|(\d+)"
s = "4758 11b98https://www.website11/111"
print(''.join(re.findall(pattern, s)))
Output
47581198
Upvotes: 0
Reputation: 262284
Change strategy, it is much easier to just keep the leading numbers and ignore the rest:
import re
test = '47581198https://www.website11/111'
re.findall(r'^\d+', test)[0]
Or, using match, if it is not sure that the leading numbers are present:
m = re.match(r'\d+', test)
if m:
m = m.group()
Output: '47581198'
If you're sure that the 'http://' string cannot be in your initial number.
Then you need two passes, one to remove the URL, and another to clean the number.
test = '4758 11b98https://www.website11/1111'
re.sub('\D', '', re.sub('https?://.*', '', test))
Output: '47581198'
Upvotes: 2
Reputation: 717
Please check the below expression:
y=re.compile('([0-9]+)(?=.*http)')
tokens = y.findall(test)
print(''.join(tokens))
Upvotes: 0