user11322373
user11322373

Reputation:

Python: Regex to find associated HTML links

I need some help writing a regex pattern which can find affiliated links from a webpage.

Example code:

import requests,re
from bs4 import BeautifulSoup
res = requests.get('https://www.example.com')
soup = BeautifulSoup(res.text,'lxml')
links = soup.find_all('a', href=True)

# example_of_affiliate_links = ['http://example.com/click/click?p=1&t=url&s=IDHERE&url=https://www.mywebsite.com/920&f=TXL&name=electronic/ps4/','https://example.net/click/camref:IDhere/destination:https://www.mywebsite.com/product/138/sony-ps4.html']

I want to collect all affiliated links for "mywebsite.com", using the following regex pattern, but it is not capturing any links.

pattern = re.compile(r'([http,https]://www.mywebsite.com\S[\.html,\.php,\&]$)')

Is there a better way to do this?

Upvotes: 0

Views: 67

Answers (1)

Zorzi
Zorzi

Reputation: 792

Here's the regex you're looking for:

https?://www.mywebsite.com\S*$

What's wrong with your regex?

([http,https]://www.mywebsite.com\S[\.html,\.php,\&]$)
  • The braces on each sides are useless
  • [] means any of those characters, so in [http,https], you're looking of one character, which might be "h", "t", "t", "p", "s" or ","
  • \S only captures one character, your need to add a multiplier after it
  • Same thing goes for the [\.html,\.php,\&] part

Upvotes: 1

Related Questions