Reputation: 1766
I am using python, I need regex to get contacts link of web page. So, I made <a (.*?)>(.*?)Contacts(.*?)</a>
and result is:
href="/ru/o-nas.html" id="menu263" title="About">About</a></li><li><a href="/ru/photogallery.html" id="menu645" title="Photo">Photo</a></li><li class="last"><a href="/ru/kontakt.html" class="last" id="menu583" title="">Contacts
,but I need on last <a ...
like
href="/ru/kontakt.html" class="last" id="menu583" title="">Contacts
What regex pattern should I use?
python code:
match = re.findall('<a (.*?)>(.*?)Contacts(.*?)</a>', body)
if match:
for m in match:
print ''.join(m)
Upvotes: 1
Views: 177
Reputation: 1679
Try BeautifulSoup
from BeautifulSoup import BeautifulSoup
import urllib2
import re
links = []
urls ['www.u1.com','www.u2.om'....]
for url in urls:
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
for link in soup.findAll('a'):
if link.string.lower() == 'contact':
links.append(link.get('href'))
Upvotes: 0
Reputation: 19801
Since you are parsing HTML, I would suggest to use BeautifulSoup
# sample html from question
html = '<li><a href="/ru/o-nas.html" id="menu263" title="About">About</a></li><li><a href="/ru/photogallery.html" id="menu645" title="Photo">Photo</a></li><li class="last"><a href="/ru/kontakt.html" class="last" id="menu583" title="">Contacts</a></li>'
from bs4 import BeautifulSoup
doc = BeautifulSoup(html)
aTag = doc.find('a', id='menu583') # id for Contacts link
print(aTag['href'])
# '/ru/kontakt.html'
Upvotes: 3