GiveItAwayNow
GiveItAwayNow

Reputation: 447

Can't get email while parsing

I'm new at scraping and parsing and I don't know what to do with the next problem. I need to scrape email from many pages. For example

The part of the code where email is:

<tr><td>Email:</td><td width="10"></td><td><script>var ylhrfq = "&#121;&#112;&#114;";var bdnd = "&#97;&#105;&#108;";var byil = "&#115;&#116;&#46;&#99;";var bwdbdf = "&#97;&#103;&#101;&#64;";var dqiex = "&#46;&#99;";var pner = "&#111;&#109;";var qkfow = "&#103;&#109;";var azzl = "&#105;&#101;";var hgcr = "&#110;&#46;&#112;&#108;";var link = byil + ylhrfq + azzl + hgcr + bwdbdf + qkfow + bdnd + dqiex + pner;var text = link;document.write('<a href="mailto:'+link+'"  />'+text+'</a>');</script></td></tr>

Is it possible to grab this email with BF ? If yes how can I do this?

Win7, Python3, BeautifulSoup

Upvotes: 1

Views: 139

Answers (2)

realhu
realhu

Reputation: 985

It seems the email address is hiden in original html and generated by javascript code. With python2, requests, js2py, BeautifulSoup4, I finally got the correct email address, hopefully this is what you wanted.

import bs4
import requests
import subprocess
import js2py
from HTMLParser import HTMLParser

html = requests.get('http://findyourvacationhome.com/find.php?property=5068927').content
soup = bs4.BeautifulSoup(html, 'html.parser')
raw_script = soup.find_all('table')[6].find_all('tr')[2].find_all('td')[2].script.contents[0]

script = raw_script.replace("""var text = link;document.write('<a href="mailto:'+link+'"  />'+text+'</a>');""", """""")
result = js2py.eval_js(script)
htmlparser = HTMLParser()
result = htmlparser.unescape(result)

print(result)

I did it in 4 steps:

  1. get the html of the web page with requests
  2. use BeautifulSoup4 to parse html code and get the javascript code which used to generate the email
  3. use js2py execute the js code and get the result.
  4. escape the string with HTMLParser

Upvotes: 1

Frode F.
Frode F.

Reputation: 54891

You need to get the parsed html. The source itself only contains placeholders and scripts. In PowerShell I would run this to get the email:

$t = Invoke-WebRequest -Uri "http://findyourvacationhome.com/find.php?property=5068927"
$t.Links | Where-Object { $_.href -match 'mailto' } | Select-Object -ExpandProperty outertext

Upvotes: 0

Related Questions