PIMg021
PIMg021

Reputation: 83

Python 2.7 BeautifulSoup , email scraping

Hope you are all well. I'm new in Python and using python 2.7.

I'm trying to extract only the mailto from this public website business directory: http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search
the mails i'm looking for are the emails mentioned in every widget from a-z in the full directory. This directory does not have an API unfortunately. I'm using BeautifulSoup, but with no success so far.
here is mycode:

import urllib
from bs4 import BeautifulSoup
website = raw_input("Type website here:>\n")
html = urllib.urlopen('http://'+ website).read()
soup = BeautifulSoup(html)

tags = soup('a') 

for tag in tags:
    print tag.get('href', None)

what i get is just the website of the actual website , like http://www.tecomdirectory.com with other href rather then the mailto or websites in the widgets. i also tried replacing soup('a') with soup ('target'), but no luck! Can anybody help me please?

Upvotes: 2

Views: 4727

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180441

You cannot just find every anchor, you need to specifically look for "mailto:" in the href, you can use a css selector a[href^=mailto:] which finds anchor tags that have a href starting with mailto::

import requests

soup  = BeautifulSoup(requests.get("http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search").content)

print([a["href"] for a in soup.select("a[href^=mailto:]")])

Or extract the text:

print([a.text for a in soup.select("a[href^=mailto:]")])

Using find_all("a") you would need to use a regex to achieve the same:

import re

find_all("a", href=re.compile(r"^mailto:"))

Upvotes: 3

Related Questions