Reputation: 22440
I've written a script in python to scrape an email address from a webpage but I am not being able to. The email address is sit within a script
tag and I can't smash that barrier to fetch the content. Any help to get that will be much appreciated.
I've tried so far with:
import requests
from bs4 import BeautifulSoup
url = "replace_with_link_above"
res = requests.get(url)
soup = BeautifulSoup(res.text, "lxml")
for items in soup.select(".profile-right-info"):
email = items.select_one("dd a[href^='mailto:']")['href']
print(email)
Upon execution I get the below error:
email = items.select_one("dd a[href^='mailto:']")['href']
TypeError: 'NoneType' object is not subscriptable
Btw, the email link is at the second row under the title profile details
in that webpage.
Upvotes: 2
Views: 189
Reputation: 1269
You should check out the Network tab of the Chrome dev tools:
There is a block of code:
<script language='JavaScript' type='text/javascript'>
<!--
var prefix = 'mailto:';
var suffix = '';
var attribs = '';
var path = 'hr' + 'ef' + '=';
var addy99716 = "Robz" + '@';
addy99716 = addy99716 + 'allinthepolish' + '.' + 'com';
document.write( '<a ' + path + '"' + prefix + addy99716 + suffix + '"' + attribs + '>' );
document.write( addy99716 );
document.write( '<\/a>' );
//-->
</script>
which evaluates to <a>
tag with href
attribute equal to:
mailto:Robz@allinthepolish.com
which will be mailto:[email protected]
if you decode the html entities, you could check it here: https://mothereff.in/html-entities
So, one option would be using something like Selenium as cgte proposed.
The other option is to get the contents of the <dd>
tag, parse the js code and then either run it with node
executable (which could be dangerous if you will not run it in a sandbox) or evaluate manually. The option with Selenium seems a lot more simple.
Upvotes: 2