Reputation: 447
I'm new at scraping and parsing and I don't know what to do with the next problem. I need to scrape email from many pages. For example
The part of the code where email is:
<tr><td>Email:</td><td width="10"></td><td><script>var ylhrfq = "ypr";var bdnd = "ail";var byil = "st.c";var bwdbdf = "age@";var dqiex = ".c";var pner = "om";var qkfow = "gm";var azzl = "ie";var hgcr = "n.pl";var link = byil + ylhrfq + azzl + hgcr + bwdbdf + qkfow + bdnd + dqiex + pner;var text = link;document.write('<a href="mailto:'+link+'" />'+text+'</a>');</script></td></tr>
Is it possible to grab this email with BF ? If yes how can I do this?
Win7, Python3, BeautifulSoup
Upvotes: 1
Views: 139
Reputation: 985
It seems the email address is hiden in original html and generated by javascript code. With python2
, requests
, js2py
, BeautifulSoup4
, I finally got the correct email address, hopefully this is what you wanted.
import bs4
import requests
import subprocess
import js2py
from HTMLParser import HTMLParser
html = requests.get('http://findyourvacationhome.com/find.php?property=5068927').content
soup = bs4.BeautifulSoup(html, 'html.parser')
raw_script = soup.find_all('table')[6].find_all('tr')[2].find_all('td')[2].script.contents[0]
script = raw_script.replace("""var text = link;document.write('<a href="mailto:'+link+'" />'+text+'</a>');""", """""")
result = js2py.eval_js(script)
htmlparser = HTMLParser()
result = htmlparser.unescape(result)
print(result)
I did it in 4 steps:
requests
BeautifulSoup4
to parse html code and get the javascript code which used to generate the email js2py
execute the js code and get the result.HTMLParser
Upvotes: 1
Reputation: 54891
You need to get the parsed html. The source itself only contains placeholders and scripts. In PowerShell I would run this to get the email:
$t = Invoke-WebRequest -Uri "http://findyourvacationhome.com/find.php?property=5068927"
$t.Links | Where-Object { $_.href -match 'mailto' } | Select-Object -ExpandProperty outertext
Upvotes: 0