pseudocode425
pseudocode425

Reputation: 77

Regex help in python to find links

I am parsing some links from an html page and I want to detect all links that match the following pattern:

http://www.example.com/category1/some-content-here/
http://www.example.com/category-12/some-content-here/

It should NOT match links below:

http://www.example.com/category1/
http://www.example.org/category-12/some-content-here/

Thanks!

Upvotes: 0

Views: 49

Answers (1)

Ajax1234
Ajax1234

Reputation: 71451

You can use BeautifulSoup to parse the HTML a tags, and then use regex to filter the original, full result:

from bs4 import BeautifulSoup as soup
import re
sample = """
<div id='test'>
    <a href='http://www.example.com/category1/some-content-here/'>Someting</a>
    <a href='http://www.example.com/category-12/some-content-here/'>Someting Here</a>
    <a href='http://www.example.com/category1/'>Someting1</a>
    <a href='http://www.example.org/category-12/some-content-here/'>Sometingelse</a>
 </div>
 """
a = [i['href'] for i in soup(sample, 'lxml').find_all('a') if re.findall('http://[\w\.]+\.com/[\w\-]+/[\w\-]+/', i['href'])]

Output:

['http://www.example.com/category1/some-content-here/', 'http://www.example.com/category-12/some-content-here/']

Upvotes: 3

Related Questions