NK20
NK20

Reputation: 75

BeautifulSoup - How to extract email from a website?

I'm trying to extract some informations from a website, but I don't know how to scrape the email.

This code works for me :

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup

url = "https://www.eurocham-cambodia.org/member/476/2-LEau-Protection"
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
soup = BeautifulSoup(page_html,"lxml")

members = soup.findAll("b")
for member in members:
    member = members[0].text
print(member)

I wanted to extract the number and link with soup.findAll() but can't find a way to get the text properly so I used the SelectorGadget tool and tried this :

numbers = soup.select("#content li:nth-child(1)")
for number in numbers:
    number = numbers[0].text
print(number)

links = soup.findAll(".icon-globe+ a")
for link in links:
    link = links[0].text
print(link)

It prints correctly :

2 L'Eau Protection
 (+33) 02 98 19 43 86
http://www.2leau-protection.com/

Now, when it comes to extract the email address i'm stuck. I'm new to this, any advice would be appreciate, thank you!

Attempt 1

emails = soup.select("#content li:nth-child(2)")
for email in emails:
    email = emails[0].text
print(email)

I don't even know what it just prints

//<![CDATA[
var l=new Array();
l[0]='>';l[1]='a';l[2]='/';l[3]='<';l[4]='|109';l[5]='|111';l[6]='|99';l[7]='|46';l[8]='|110';l[9]='|111';l[10]='|105';l[11]='|116';l[12]='|99';l[13]='|101';l[14]='|116';l[15]='|111';l[16]='|114';l[17]='|112';l[18]='|45';l[19]='|117';l[20]='|97';l[21]='|101';l[22]='|108';l[23]='|50';l[24]='|64';l[25]='|110';l[26]='|111';l[27]='|105';l[28]='|116';l[29]='|97';l[30]='|109';l[31]='|114';l[32]='|111';l[33]='|102';l[34]='|110';l[35]='|105';l[36]='|32';l[37]='>';l[38]='"';l[39]='|109';l[40]='|111';l[41]='|99';l[42]='|46';l[43]='|110';l[44]='|111';l[45]='|105';l[46]='|116';l[47]='|99';l[48]='|101';l[49]='|116';l[50]='|111';l[51]='|114';l[52]='|112';l[53]='|45';l[54]='|117';l[55]='|97';l[56]='|101';l[57]='|108';l[58]='|50';l[59]='|64';l[60]='|110';l[61]='|111';l[62]='|105';l[63]='|116';l[64]='|97';l[65]='|109';l[66]='|114';l[67]='|111';l[68]='|102';l[69]='|110';l[70]='|105';l[71]='|32';l[72]=':';l[73]='o';l[74]='t';l[75]='l';l[76]='i';l[77]='a';l[78]='m';l[79]='"';l[80]='=';l[81]='f';l[82]='e';l[83]='r';l[84]='h';l[85]=' ';l[86]='a';l[87]='<';
for (var i = l.length-1; i >= 0; i=i-1){
if (l[i].substring(0, 1) == '|') document.write("&#"+unescape(l[i].substring(1))+";");
else document.write(unescape(l[i]));}
//]]>

Attempt 2

emails = soup.select(".icon-mail~ a") #follow the same logic
for email in emails:
    email = emails[0].text
print(email)

Error

NameError: name 'email' is not defined

Attempt 3

emails = soup.select(".icon-mail~ a")
print(emails)

Print empty

[]

Attempt 4,5,6

email = soup.find("a",{"href":"mailto:"}) # Print "None"

email = soup.findAll("a",{"href":"mailto:"}) # Print empty "[]"

email = soup.select("a",{"href":"mailto:"}) # Print a lot of informations but not the one that I need.

Upvotes: 4

Views: 13375

Answers (7)

Nishant Jalan
Nishant Jalan

Reputation: 1038

If you want to find the email address, you can use regex to do so. Import the module and search the text and extract the data and put it in a list.

import re
..
text = soup.get_text()
list = re.findall(r'[a-z0-9]+@[gmail|yahoo|rediff].com', text)
for email in list:
    print(email)

Let me know the result. Happy coding!

Upvotes: 2

samarth mishra
samarth mishra

Reputation: 11

import re
text =soup.get_text()
emails = re.findall(r'[a-z0-9]+@\S+.com', str(text))
print(emails)

this is a much more convenient way to print emails form a website

Upvotes: 1

Daniyal Ahmed
Daniyal Ahmed

Reputation: 1

I found this method more accurate...

text = get(url).content
emails = re.findall(r'[a-z0-9]+@\S+.com', str(text))

Upvotes: -1

Paul Becotte
Paul Becotte

Reputation: 9997

I see that you already have perfectly acceptable answers, but when I saw that obfuscation script I was fascinated, and just had to "de-obfuscate" it.

from bs4 import BeautifulSoup
from requests import get
import re

page = "https://www.eurocham-cambodia.org/member/476/2-LEau-Protection"

content = get(page).content
soup = BeautifulSoup(content, "lxml")

exp = re.compile(r"(?:.*?='(.*?)')")
# Find any element with the mail icon
for icon in soup.findAll("i", {"class": "icon-mail"}):
    # the 'a' element doesn't exist, there is a script tag instead
    script = icon.next_sibling
    # the script tag builds a long array of single characters- lets gra
    chars = exp.findall(script.text)
    output = []
    # the javascript array is iterated backwards
    for char in reversed(list(chars)):
        # many characters use their ascii representation instead of simple text
        if char.startswith("|"):
            output.append(chr(int(char[1:])))
        else:
            output.append(char)
    # putting the array back together gets us an `a` element
    link = BeautifulSoup("".join(output))
    email = link.findAll("a")[0]["href"][8:]
    # the email is the part of the href after `mailto: `
    print(email)

Upvotes: 3

urllib and beautifulsoup combination may be insufficient in some cases like a webpage running and displaying information by a API call or JavaScript. You are getting the very first instance of the website, before it loads anything externally. That's why you may need to emulate a real browser somehow. You can do it by using javascript calls, however there is a more convenient way.

Selenium library is being employed for automating web-tasks and test automation. It can be also employed as a scraper. Since it uses a real backbone of a browser (like Mozilla Gecko or Google Chrome Driver) it appears to be more robust for most of the cases. Here is an example of how you can accomplish your task:

from selenium import webdriver

url = "https://www.eurocham-cambodia.org/member/476/2-LEau-Protection"


option = webdriver.ChromeOptions()
option.add_argument("--headless")
browser = webdriver.Chrome(executable_path="./chromedriver", options=option)

browser.get(url)

print(browser.find_element_by_css_selector(".icon-mail~ a").text)

The output is:

[email protected]

Edit: You can obtain selenium by pip install selenium and you can find chrome driver from here

Upvotes: 3

Mukul Kumar Jha
Mukul Kumar Jha

Reputation: 1082

The reason cannot scrape the given part of that website is because it is generated by JavaScript and is not present initially . This can be checked by the following code snippet

    import lxml
    import requests

    page = requests.get(https://www.eurocham-cambodia.org/member/476/2-LEau- Protection).text
    tree = html.fromstring(page)
    print(lxml.html.tostring(tree, pretty_print=True).decode())

which gives the complete HTML document to you , but let us just focus on the div containing the users profile.

    <div class="col-sm-12 col-md-6">
       <ul class="iconlist">
          <li>
             <i class="icon-phone"> </i>(+33) 02 98 19 43 86</li>

          <li>
              <i class="icon-mail"> </i><script type="text/javascript">
                //<![CDATA[
                var l=new Array();
    l[0]='>';l[1]='a';l[2]='/';l[3]='<';l[4]='|109';l[5]='|111';l[6]='|99';l[7]='|46';l[8]='|110';l[9]='|111';l[10]='|105';l[11]='|116';l[12]='|99';l[13]='|101';l[14]='|116';l[15]='|111';l[16]='|114';l[17]='|112';l[18]='|45';l[19]='|117';l[20]='|97';l[21]='|101';l[22]='|108';l[23]='|50';l[24]='|64';l[25]='|110';l[26]='|111';l[27]='|105';l[28]='|116';l[29]='|97';l[30]='|109';l[31]='|114';l[32]='|111';l[33]='|102';l[34]='|110';l[35]='|105';l[36]='|32';l[37]='>';l[38]='"';l[39]='|109';l[40]='|111';l[41]='|99';l[42]='|46';l[43]='|110';l[44]='|111';l[45]='|105';l[46]='|116';l[47]='|99';l[48]='|101';l[49]='|116';l[50]='|111';l[51]='|114';l[52]='|112';l[53]='|45';l[54]='|117';l[55]='|97';l[56]='|101';l[57]='|108';l[58]='|50';l[59]='|64';l[60]='|110';l[61]='|111';l[62]='|105';l[63]='|116';l[64]='|97';l[65]='|109';l[66]='|114';l[67]='|111';l[68]='|102';l[69]='|110';l[70]='|105';l[71]='|32';l[72]=':';l[73]='o';l[74]='t';l[75]='l';l[76]='i';l[77]='a';l[78]='m';l[79]='"';l[80]='=';l[81]='f';l[82]='e';l[83]='r';l[84]='h';l[85]=' ';l[86]='a';l[87]='<';
    for (var i = l.length-1; i >= 0; i=i-1){
    if (l[i].substring(0, 1) == '|') document.write("&#"+unescape(l[i].substring(1))+";");
    else document.write(unescape(l[i]));}
    //]]>
              </script>
           </li>
           <li>
            <i class="icon-globe"></i> <a href="http://www.2leau-protection.com/" target="_blank"><i style="background-color:#2C3E50"></i>http://www.2leau-protection.com/</a>
          </li>
        </ul>
     </div>

see carefully , this is the same script which you scraped above when you were trying to scrape the emails in your Attempt 1.

Upvotes: 3

Roger Lindsj&#246;
Roger Lindsj&#246;

Reputation: 11553

BeautifulSoup only handles the HTML of the page, it does not execute any JavaScrip. The email address is generated with JavaScript as the document is loaded (probably to make it harder to scrape that information).

In this case it is generated by:

<script type="text/javascript">
    //<![CDATA[
    var l=new Array();
    l[0]='>';l[1]='a';l[2]='/';l[3]='<';l[4]='|109';l[5]='|111';l[6]='|99';l[7]='|46';l[8]='|110';l[9]='|111';l[10]='|105';l[11]='|116';l[12]='|99';l[13]='|101';l[14]='|116';l[15]='|111';l[16]='|114';l[17]='|112';l[18]='|45';l[19]='|117';l[20]='|97';l[21]='|101';l[22]='|108';l[23]='|50';l[24]='|64';l[25]='|110';l[26]='|111';l[27]='|105';l[28]='|116';l[29]='|97';l[30]='|109';l[31]='|114';l[32]='|111';l[33]='|102';l[34]='|110';l[35]='|105';l[36]='|32';l[37]='>';l[38]='"';l[39]='|109';l[40]='|111';l[41]='|99';l[42]='|46';l[43]='|110';l[44]='|111';l[45]='|105';l[46]='|116';l[47]='|99';l[48]='|101';l[49]='|116';l[50]='|111';l[51]='|114';l[52]='|112';l[53]='|45';l[54]='|117';l[55]='|97';l[56]='|101';l[57]='|108';l[58]='|50';l[59]='|64';l[60]='|110';l[61]='|111';l[62]='|105';l[63]='|116';l[64]='|97';l[65]='|109';l[66]='|114';l[67]='|111';l[68]='|102';l[69]='|110';l[70]='|105';l[71]='|32';l[72]=':';l[73]='o';l[74]='t';l[75]='l';l[76]='i';l[77]='a';l[78]='m';l[79]='"';l[80]='=';l[81]='f';l[82]='e';l[83]='r';l[84]='h';l[85]=' ';l[86]='a';l[87]='<';
    for (var i = l.length-1; i >= 0; i=i-1){
    if (l[i].substring(0, 1) == '|') document.write("&#"+unescape(l[i].substring(1))+";");
    else document.write(unescape(l[i]));}
    //]]>
</script>

Upvotes: 2

Related Questions