wen tian
wen tian

Reputation: 441

Regex to extract specific number from the URL based on the URL pattern

I am trying to extract the number from the URL. Here is the code I tried:

urlss = 'http://www.deyi.com/thread-24488-1-1.html'
urlss = re.sub('http://www.deyi.com/thread-(.*?)-1-1.html', '', urlss)
print(urlss)

My expected result is the below number:

24488

How can I achieve this?

Upvotes: 0

Views: 67

Answers (2)

user9158931
user9158931

Reputation:

You can use Positive Lookahead (?=(\d+))

import re
urlss = 'http://www.deyi.com/thread-24488-1-1.html'

pattern='thread-(?=(\d+))'

match=re.search(pattern,urlss)
print(match.group(1))

output:

24488

If every time url pattern is same only some variable or pages chainging then you can use simple pattern like this:

import re
urlss = 'http://www.deyi.com/thread-24488-1-1.html'

pattern='(\d+){2}'

match=re.search(pattern,urlss)
print(match.group())

output:

24488

Upvotes: 0

Moinuddin Quadri
Moinuddin Quadri

Reputation: 48120

re.sub replaces the content in the string. You need to use re.search to extract the substring. You can use below regex to extract your desired number from url:

'(?<=thread-)\d+'

This regex will return the string of first continuous series of number just after the "thread-".

For example:

>>> urlss = 'http://www.deyi.com/thread-24488-1-1.html'
>>> import re

>>> re.search('(?<=thread-)\d+', urlss).group()
'24488'

Upvotes: 2

Related Questions