Reputation: 47
Well I was trying to scrape a website http://www.popsci.com/thorium-dream for learning purpose.
I tried scraping it to get the video src but unable to so, as the video tag is injected by javascript.
Looked at the network to see xhr requests saw the media file request for the video.
General
Remote Address:68.232.45.253:80
Request URL:http://video.net2.tv/PORTICO/TECH/POPSCI/POP_84/POP_20140718_84_Thorium_A/POP_20140718_84_Thorium_A_1200.mp4
Request Method:GET
Status Code:206 Partial Content (from cache)
Response Headers
Accept-Ranges:bytes
Cache-Control:max-age=604800
Content-Length:24833827
Content-Range:bytes 0-24833826/24833827
Content-Type:video/mp4
Date:Mon, 14 Sep 2015 02:54:29 GMT
Etag:"734657553"
Expires:Mon, 21 Sep 2015 02:54:29 GMT
Last-Modified:Fri, 18 Jul 2014 21:56:46 GMT
Server:ECAcc (cpm/F8B9)
X-Cache:HIT
Request Headers
Provisional headers are shown
Accept-Encoding:identity;q=1, *;q=0
Range:bytes=0-
Referer:http://player.net2.tv/?episode=53c9973ae7dbcc820502c81c&restart=true&snipe=true
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.132 Safari/537.36
How can I get the URL from scraping? Also please tell a solution using default python libraries if possible.
Upvotes: 3
Views: 3069
Reputation: 4021
I've coded something for you. It extracts all the videos from POPSCI episodes pages:
import re
import requests
from lxml import html
def getVideosLinks(content):
videos = re.findall('(http://[\.\w/_]+\.mp[34])', content)
return videos
def prepareJSONurl(episode_hash):
json_url = "http://pepto.portico.net2.tv/playlist/{hash}".format(hash=episode_hash)
return json_url
def extractEpisodeHash(content):
tree = html.fromstring(content)
video_url = tree.xpath('//meta[contains(@http-equiv, "refresh")]/@content')[0].split('=',1)[1]
episode_hash = re.findall('episode=([\w]+)', video_url)
return episode_hash[0]
def extractIframeURL(content):
iframe_url = None
tree = html.fromstring(content)
try:
iframe_url = tree.xpath('//iframe/@src')[0]
is_video = True
except:
is_video = False
return is_video, iframe_url
POPSCI_URL = "http://www.popsci.com/thorium-dream"
response = requests.get(POPSCI_URL)
is_video, iframe_url = extractIframeURL(response.content)
if is_video:
response_from_iframe_url = requests.get(iframe_url)
episode_hash = extractEpisodeHash(response_from_iframe_url.content)
json_url = prepareJSONurl(episode_hash)
final_response = requests.get(json_url)
for video in getVideosLinks(final_response.content):
print "Video: {}".format(video)
else:
print "This is not a POPSCI video page :|"
They have different video qualities and sizes, so you will see more than one .mp4 video URL for each episode.
This code works for any POPSCI episodes page, try changing POPSCI_URL to...
POPSCI_URL = "http://www.popsci.com/maker-faire-2015"
... and it will still work.
Even so it is not recommended to parse HTML with Regular Expressions (regexp) I have created a regexp version for you (as requested). It works but regular expressions could be improved:
import re
import requests
def getVideosLinks(content):
videos = re.findall('(http://[\.\w/_]+\.mp[34])', content)
return videos
def prepareJSONurl(episode_hash):
json_url = "http://pepto.portico.net2.tv/playlist/{hash}".format(hash=episode_hash)
return json_url
def extractEpisodeHash(content):
episode_hash = re.findall('<meta http-equiv="refresh" content="0; url=http:\/\/player\.net2\.tv\?episode=([\w]+)&restart', content)[0]
return episode_hash
def extractIframeURL(content):
iframe_url = None
try:
iframe_url = re.findall('<iframe src="(.*)" style', content)[0]
is_video = True
except:
is_video = False
return is_video, iframe_url
POPSCI_URL = "http://www.popsci.com/thorium-dream"
response = requests.get(POPSCI_URL)
is_video, iframe_url = extractIframeURL(response.content)
if is_video:
response_from_iframe_url = requests.get(iframe_url)
episode_hash = extractEpisodeHash(response_from_iframe_url.content)
json_url = prepareJSONurl(episode_hash)
final_response = requests.get(json_url)
for video in getVideosLinks(final_response.content):
print "Video: {}".format(video)
else:
print "This is not a POPSCI video page :|"
Hope this helps
Upvotes: 2