Reputation: 75
I'm trying to extract some informations from a website, but I don't know how to scrape the email.
This code works for me :
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup
url = "https://www.eurocham-cambodia.org/member/476/2-LEau-Protection"
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
soup = BeautifulSoup(page_html,"lxml")
members = soup.findAll("b")
for member in members:
member = members[0].text
print(member)
I wanted to extract the number and link with soup.findAll() but can't find a way to get the text properly so I used the SelectorGadget tool and tried this :
numbers = soup.select("#content li:nth-child(1)")
for number in numbers:
number = numbers[0].text
print(number)
links = soup.findAll(".icon-globe+ a")
for link in links:
link = links[0].text
print(link)
It prints correctly :
2 L'Eau Protection
(+33) 02 98 19 43 86
http://www.2leau-protection.com/
Now, when it comes to extract the email address i'm stuck. I'm new to this, any advice would be appreciate, thank you!
Attempt 1
emails = soup.select("#content li:nth-child(2)")
for email in emails:
email = emails[0].text
print(email)
I don't even know what it just prints
//<![CDATA[
var l=new Array();
l[0]='>';l[1]='a';l[2]='/';l[3]='<';l[4]='|109';l[5]='|111';l[6]='|99';l[7]='|46';l[8]='|110';l[9]='|111';l[10]='|105';l[11]='|116';l[12]='|99';l[13]='|101';l[14]='|116';l[15]='|111';l[16]='|114';l[17]='|112';l[18]='|45';l[19]='|117';l[20]='|97';l[21]='|101';l[22]='|108';l[23]='|50';l[24]='|64';l[25]='|110';l[26]='|111';l[27]='|105';l[28]='|116';l[29]='|97';l[30]='|109';l[31]='|114';l[32]='|111';l[33]='|102';l[34]='|110';l[35]='|105';l[36]='|32';l[37]='>';l[38]='"';l[39]='|109';l[40]='|111';l[41]='|99';l[42]='|46';l[43]='|110';l[44]='|111';l[45]='|105';l[46]='|116';l[47]='|99';l[48]='|101';l[49]='|116';l[50]='|111';l[51]='|114';l[52]='|112';l[53]='|45';l[54]='|117';l[55]='|97';l[56]='|101';l[57]='|108';l[58]='|50';l[59]='|64';l[60]='|110';l[61]='|111';l[62]='|105';l[63]='|116';l[64]='|97';l[65]='|109';l[66]='|114';l[67]='|111';l[68]='|102';l[69]='|110';l[70]='|105';l[71]='|32';l[72]=':';l[73]='o';l[74]='t';l[75]='l';l[76]='i';l[77]='a';l[78]='m';l[79]='"';l[80]='=';l[81]='f';l[82]='e';l[83]='r';l[84]='h';l[85]=' ';l[86]='a';l[87]='<';
for (var i = l.length-1; i >= 0; i=i-1){
if (l[i].substring(0, 1) == '|') document.write("&#"+unescape(l[i].substring(1))+";");
else document.write(unescape(l[i]));}
//]]>
Attempt 2
emails = soup.select(".icon-mail~ a") #follow the same logic
for email in emails:
email = emails[0].text
print(email)
Error
NameError: name 'email' is not defined
Attempt 3
emails = soup.select(".icon-mail~ a")
print(emails)
Print empty
[]
Attempt 4,5,6
email = soup.find("a",{"href":"mailto:"}) # Print "None"
email = soup.findAll("a",{"href":"mailto:"}) # Print empty "[]"
email = soup.select("a",{"href":"mailto:"}) # Print a lot of informations but not the one that I need.
Upvotes: 4
Views: 13375
Reputation: 1038
If you want to find the email address, you can use regex to do so. Import the module and search the text and extract the data and put it in a list.
import re
..
text = soup.get_text()
list = re.findall(r'[a-z0-9]+@[gmail|yahoo|rediff].com', text)
for email in list:
print(email)
Let me know the result. Happy coding!
Upvotes: 2
Reputation: 11
import re
text =soup.get_text()
emails = re.findall(r'[a-z0-9]+@\S+.com', str(text))
print(emails)
this is a much more convenient way to print emails form a website
Upvotes: 1
Reputation: 1
I found this method more accurate...
text = get(url).content
emails = re.findall(r'[a-z0-9]+@\S+.com', str(text))
Upvotes: -1
Reputation: 9997
I see that you already have perfectly acceptable answers, but when I saw that obfuscation script I was fascinated, and just had to "de-obfuscate" it.
from bs4 import BeautifulSoup
from requests import get
import re
page = "https://www.eurocham-cambodia.org/member/476/2-LEau-Protection"
content = get(page).content
soup = BeautifulSoup(content, "lxml")
exp = re.compile(r"(?:.*?='(.*?)')")
# Find any element with the mail icon
for icon in soup.findAll("i", {"class": "icon-mail"}):
# the 'a' element doesn't exist, there is a script tag instead
script = icon.next_sibling
# the script tag builds a long array of single characters- lets gra
chars = exp.findall(script.text)
output = []
# the javascript array is iterated backwards
for char in reversed(list(chars)):
# many characters use their ascii representation instead of simple text
if char.startswith("|"):
output.append(chr(int(char[1:])))
else:
output.append(char)
# putting the array back together gets us an `a` element
link = BeautifulSoup("".join(output))
email = link.findAll("a")[0]["href"][8:]
# the email is the part of the href after `mailto: `
print(email)
Upvotes: 3
Reputation: 105
urllib and beautifulsoup combination may be insufficient in some cases like a webpage running and displaying information by a API call or JavaScript. You are getting the very first instance of the website, before it loads anything externally. That's why you may need to emulate a real browser somehow. You can do it by using javascript calls, however there is a more convenient way.
Selenium library is being employed for automating web-tasks and test automation. It can be also employed as a scraper. Since it uses a real backbone of a browser (like Mozilla Gecko or Google Chrome Driver) it appears to be more robust for most of the cases. Here is an example of how you can accomplish your task:
from selenium import webdriver
url = "https://www.eurocham-cambodia.org/member/476/2-LEau-Protection"
option = webdriver.ChromeOptions()
option.add_argument("--headless")
browser = webdriver.Chrome(executable_path="./chromedriver", options=option)
browser.get(url)
print(browser.find_element_by_css_selector(".icon-mail~ a").text)
The output is:
[email protected]
Edit: You can obtain selenium by pip install selenium
and you can find chrome driver from here
Upvotes: 3
Reputation: 1082
The reason cannot scrape the given part of that website is because it is generated by JavaScript and is not present initially . This can be checked by the following code snippet
import lxml
import requests
page = requests.get(https://www.eurocham-cambodia.org/member/476/2-LEau- Protection).text
tree = html.fromstring(page)
print(lxml.html.tostring(tree, pretty_print=True).decode())
which gives the complete HTML document to you , but let us just focus on the div
containing the users profile.
<div class="col-sm-12 col-md-6">
<ul class="iconlist">
<li>
<i class="icon-phone"> </i>(+33) 02 98 19 43 86</li>
<li>
<i class="icon-mail"> </i><script type="text/javascript">
//<![CDATA[
var l=new Array();
l[0]='>';l[1]='a';l[2]='/';l[3]='<';l[4]='|109';l[5]='|111';l[6]='|99';l[7]='|46';l[8]='|110';l[9]='|111';l[10]='|105';l[11]='|116';l[12]='|99';l[13]='|101';l[14]='|116';l[15]='|111';l[16]='|114';l[17]='|112';l[18]='|45';l[19]='|117';l[20]='|97';l[21]='|101';l[22]='|108';l[23]='|50';l[24]='|64';l[25]='|110';l[26]='|111';l[27]='|105';l[28]='|116';l[29]='|97';l[30]='|109';l[31]='|114';l[32]='|111';l[33]='|102';l[34]='|110';l[35]='|105';l[36]='|32';l[37]='>';l[38]='"';l[39]='|109';l[40]='|111';l[41]='|99';l[42]='|46';l[43]='|110';l[44]='|111';l[45]='|105';l[46]='|116';l[47]='|99';l[48]='|101';l[49]='|116';l[50]='|111';l[51]='|114';l[52]='|112';l[53]='|45';l[54]='|117';l[55]='|97';l[56]='|101';l[57]='|108';l[58]='|50';l[59]='|64';l[60]='|110';l[61]='|111';l[62]='|105';l[63]='|116';l[64]='|97';l[65]='|109';l[66]='|114';l[67]='|111';l[68]='|102';l[69]='|110';l[70]='|105';l[71]='|32';l[72]=':';l[73]='o';l[74]='t';l[75]='l';l[76]='i';l[77]='a';l[78]='m';l[79]='"';l[80]='=';l[81]='f';l[82]='e';l[83]='r';l[84]='h';l[85]=' ';l[86]='a';l[87]='<';
for (var i = l.length-1; i >= 0; i=i-1){
if (l[i].substring(0, 1) == '|') document.write("&#"+unescape(l[i].substring(1))+";");
else document.write(unescape(l[i]));}
//]]>
</script>
</li>
<li>
<i class="icon-globe"></i> <a href="http://www.2leau-protection.com/" target="_blank"><i style="background-color:#2C3E50"></i>http://www.2leau-protection.com/</a>
</li>
</ul>
</div>
see carefully , this is the same script which you scraped above when you were trying to scrape the emails in your Attempt 1.
Upvotes: 3
Reputation: 11553
BeautifulSoup only handles the HTML of the page, it does not execute any JavaScrip. The email address is generated with JavaScript as the document is loaded (probably to make it harder to scrape that information).
In this case it is generated by:
<script type="text/javascript">
//<![CDATA[
var l=new Array();
l[0]='>';l[1]='a';l[2]='/';l[3]='<';l[4]='|109';l[5]='|111';l[6]='|99';l[7]='|46';l[8]='|110';l[9]='|111';l[10]='|105';l[11]='|116';l[12]='|99';l[13]='|101';l[14]='|116';l[15]='|111';l[16]='|114';l[17]='|112';l[18]='|45';l[19]='|117';l[20]='|97';l[21]='|101';l[22]='|108';l[23]='|50';l[24]='|64';l[25]='|110';l[26]='|111';l[27]='|105';l[28]='|116';l[29]='|97';l[30]='|109';l[31]='|114';l[32]='|111';l[33]='|102';l[34]='|110';l[35]='|105';l[36]='|32';l[37]='>';l[38]='"';l[39]='|109';l[40]='|111';l[41]='|99';l[42]='|46';l[43]='|110';l[44]='|111';l[45]='|105';l[46]='|116';l[47]='|99';l[48]='|101';l[49]='|116';l[50]='|111';l[51]='|114';l[52]='|112';l[53]='|45';l[54]='|117';l[55]='|97';l[56]='|101';l[57]='|108';l[58]='|50';l[59]='|64';l[60]='|110';l[61]='|111';l[62]='|105';l[63]='|116';l[64]='|97';l[65]='|109';l[66]='|114';l[67]='|111';l[68]='|102';l[69]='|110';l[70]='|105';l[71]='|32';l[72]=':';l[73]='o';l[74]='t';l[75]='l';l[76]='i';l[77]='a';l[78]='m';l[79]='"';l[80]='=';l[81]='f';l[82]='e';l[83]='r';l[84]='h';l[85]=' ';l[86]='a';l[87]='<';
for (var i = l.length-1; i >= 0; i=i-1){
if (l[i].substring(0, 1) == '|') document.write("&#"+unescape(l[i].substring(1))+";");
else document.write(unescape(l[i]));}
//]]>
</script>
Upvotes: 2