Reputation: 83
Hope you are all well. I'm new in Python and using python 2.7.
I'm trying to extract only the mailto from this public website business directory: http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search
the mails i'm looking for are the emails mentioned in every widget from a-z in the full directory. This directory does not have an API unfortunately.
I'm using BeautifulSoup, but with no success so far.
here is mycode:
import urllib
from bs4 import BeautifulSoup
website = raw_input("Type website here:>\n")
html = urllib.urlopen('http://'+ website).read()
soup = BeautifulSoup(html)
tags = soup('a')
for tag in tags:
print tag.get('href', None)
what i get is just the website of the actual website , like http://www.tecomdirectory.com with other href rather then the mailto or websites in the widgets. i also tried replacing soup('a') with soup ('target'), but no luck! Can anybody help me please?
Upvotes: 2
Views: 4727
Reputation: 180441
You cannot just find every anchor, you need to specifically look for "mailto:" in the href, you can use a css selector a[href^=mailto:]
which finds anchor tags that have a href starting with mailto:
:
import requests
soup = BeautifulSoup(requests.get("http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search").content)
print([a["href"] for a in soup.select("a[href^=mailto:]")])
Or extract the text:
print([a.text for a in soup.select("a[href^=mailto:]")])
Using find_all("a")
you would need to use a regex to achieve the same:
import re
find_all("a", href=re.compile(r"^mailto:"))
Upvotes: 3