Reputation: 77
I am parsing some links from an html page and I want to detect all links that match the following pattern:
http://www.example.com/category1/some-content-here/
http://www.example.com/category-12/some-content-here/
It should NOT match links below:
http://www.example.com/category1/
http://www.example.org/category-12/some-content-here/
Thanks!
Upvotes: 0
Views: 49
Reputation: 71451
You can use BeautifulSoup
to parse the HTML a
tags, and then use regex to filter the original, full result:
from bs4 import BeautifulSoup as soup
import re
sample = """
<div id='test'>
<a href='http://www.example.com/category1/some-content-here/'>Someting</a>
<a href='http://www.example.com/category-12/some-content-here/'>Someting Here</a>
<a href='http://www.example.com/category1/'>Someting1</a>
<a href='http://www.example.org/category-12/some-content-here/'>Sometingelse</a>
</div>
"""
a = [i['href'] for i in soup(sample, 'lxml').find_all('a') if re.findall('http://[\w\.]+\.com/[\w\-]+/[\w\-]+/', i['href'])]
Output:
['http://www.example.com/category1/some-content-here/', 'http://www.example.com/category-12/some-content-here/']
Upvotes: 3