Reputation: 331
I'm using PyQt4's QtWebKit to a render a webpage in memory, because I need the javascript executed as I need to retrieve an embedded flash video element. Currently the code I'm using looks like this:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import QWebSettings, QWebPage
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
# Settings
s = self.settings()
s.setAttribute(QWebSettings.AutoLoadImages, False)
s.setAttribute(QWebSettings.JavascriptCanOpenWindows, False)
s.setAttribute(QWebSettings.PluginsEnabled, True)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
def get_page_source(url):
r = Render(url)
html = r.frame.toHtml()
return html
Now this works OK, though it is extremely slow to initialize(taking anywhere between 5-30 seconds to start), however it only works OK for a single page. Meaning that on the first webpage, my final output looks like this:
<div>
<embed type="application/x-shockwave-flash" src="/player.swf" width="560" height="440" style="undefined" id="mediaplayer" name="mediaplayer" quality="high" allowfullscreen="true" wmode="opaque" flashvars="width=560&height=440&autostart=true&fullscreen=true&file=FILELINK"></embed>
</div>
But on successive attempts, it looks like this:
<div>
<font>
<u>
<b>
<a href="http://get.adobe.com/flashplayer/">ATTENTION:<br>This video will not play. You currently do not have Adobe Flash installed on this computer. Please click here to download it (it's free!)
</a>
</b>
</u>
</font>
</div>
What is happening here that I'm not aware of?
Upvotes: 1
Views: 850
Reputation: 56654
It looks like your javascript interpreter only kicks in on the first page; the second page loads but never gets its javascript run; but that's irrelevant to your real problem, which is that the name of the video file is hidden in the chunk of code that looks like
<script type="text/javascript">
var googleCode = 'czEuYWRkVmFyaWFibGUoImZpbGUiLCJodHRwOi8vd2lsbGlhbS5yaWtlci53aW1wLmNvbS9sb2FkdmlkZW8vMDA5YzUwMzNkZmYyMDQ3MmJiYzBjMjk2NmJjNzI2MjIvNGZmNGQ2ZDYvd2ViLXZpZGVvcy9iZTVjYWI2YjcxNmU0OWExZjFiYzc3NGNlMjVlZDg0Yl93YWtlci5mbHYiKTs=';
eval(lxUTILsign.decode(googleCode));
</script>
If you call up a javascript console and run lxUTILsign.decode(googleCode);
you get
"s1.addVariable(\"file\",\"http://worf.wimp.com/loadvideo/2e368b70f8577ad167087530fc73748d/4ff4f5df/web-videos/35e78d1932b24f80ae3a9210fce008c4_titanic.flv\");"
The bad news is that lxUTILsign is thoroughly obfuscated; the good news is, that's irrelevant, because it is simply a base64 decoder, and Python already has one (batteries included, baby!).
import base64
import urllib2
import re
def get_video_url(page_url):
html = urllib2.urlopen(url).read()
match = re.search("googleCode = '(.*?)'", html)
if match is None:
raise ValueError('googleCode not found')
googleString = base64.b64decode(match.group(1))
match = re.search('","(.*?)"', googleString)
if match is None:
raise ValueError("didn't find video url")
return match.group(1)
url = 'http://www.wimp.com/titanicpiano/'
print get_video_url(url)
returns
http://worf.wimp.com/loadvideo/8656607f77689f759d54b4ec7207152d/4ff4ff9c/web-videos/35e78d1932b24f80ae3a9210fce008c4_titanic.flv
Upvotes: 1