Reputation:
I need some help writing a regex pattern which can find affiliated links from a webpage.
import requests,re
from bs4 import BeautifulSoup
res = requests.get('https://www.example.com')
soup = BeautifulSoup(res.text,'lxml')
links = soup.find_all('a', href=True)
# example_of_affiliate_links = ['http://example.com/click/click?p=1&t=url&s=IDHERE&url=https://www.mywebsite.com/920&f=TXL&name=electronic/ps4/','https://example.net/click/camref:IDhere/destination:https://www.mywebsite.com/product/138/sony-ps4.html']
I want to collect all affiliated links for "mywebsite.com", using the following regex pattern, but it is not capturing any links.
pattern = re.compile(r'([http,https]://www.mywebsite.com\S[\.html,\.php,\&]$)')
Is there a better way to do this?
Upvotes: 0
Views: 67
Reputation: 792
Here's the regex you're looking for:
https?://www.mywebsite.com\S*$
([http,https]://www.mywebsite.com\S[\.html,\.php,\&]$)
[]
means any of those characters, so in [http,https]
, you're looking of one character, which might be "h
", "t
", "t
", "p
", "s
" or ",
"\S
only captures one character, your need to add a multiplier after it[\.html,\.php,\&]
partUpvotes: 1