robots.txt
robots.txt

Reputation: 137

Unable to scrape an email address from a webpage using requests module

I'm trying to scrape an email address from this webpage using requests module, not selenium. Although the email address is obfuscated and not present in page source, a javascript function generates this. How can I make use of the following portion to get the email address visible in that webpage?

document.write("\u003cn uers=\"znvygb:[email protected]\"\[email protected]\u003c/n\u003e".replace(/[a-zA-Z]/g, function(c){return String.fromCharCode((c<="Z"?90:122)>=(c=c.charCodeAt(0)+13)?c:c-26);}));

I've tried so far with:

import requests
from bs4 import BeautifulSoup

link = 'https://www.californiatoplawyers.com/lawyer/311805/tobyn-yael-aaron'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
email = soup.select_one("dt:-soup-contains('Email') + dd")
print(email)

Expected output:

[email protected]

Upvotes: 0

Views: 52

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195528

For these tasks I recommend js2py module:

import js2py
import requests
from bs4 import BeautifulSoup

link = "https://www.californiatoplawyers.com/lawyer/311805/tobyn-yael-aaron"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
}
res = requests.get(link, headers=headers)
soup = BeautifulSoup(res.text, "html.parser")
email = soup.select_one("dt:-soup-contains('Email') + dd")

js_code = email.script.contents[0].replace("document.write", "")
email = BeautifulSoup(js2py.eval_js(js_code), "html.parser").text
print(email)

Prints:

[email protected]

Upvotes: 2

Related Questions