rrindam
rrindam

Reputation: 47

How to scrape the video src url from video tag which is injected via javascript?

Well I was trying to scrape a website http://www.popsci.com/thorium-dream for learning purpose.

I tried scraping it to get the video src but unable to so, as the video tag is injected by javascript.

Looked at the network to see xhr requests saw the media file request for the video.

General
Remote Address:68.232.45.253:80
Request URL:http://video.net2.tv/PORTICO/TECH/POPSCI/POP_84/POP_20140718_84_Thorium_A/POP_20140718_84_Thorium_A_1200.mp4
Request Method:GET
Status Code:206 Partial Content (from cache)
Response Headers
Accept-Ranges:bytes
Cache-Control:max-age=604800
Content-Length:24833827
Content-Range:bytes 0-24833826/24833827
Content-Type:video/mp4
Date:Mon, 14 Sep 2015 02:54:29 GMT
Etag:"734657553"
Expires:Mon, 21 Sep 2015 02:54:29 GMT
Last-Modified:Fri, 18 Jul 2014 21:56:46 GMT
Server:ECAcc (cpm/F8B9)
X-Cache:HIT
Request Headers
Provisional headers are shown
Accept-Encoding:identity;q=1, *;q=0
Range:bytes=0-
Referer:http://player.net2.tv/?episode=53c9973ae7dbcc820502c81c&restart=true&snipe=true
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.132 Safari/537.36

How can I get the URL from scraping? Also please tell a solution using default python libraries if possible.

Upvotes: 3

Views: 3069

Answers (1)

I've coded something for you. It extracts all the videos from POPSCI episodes pages:

import re
import requests
from lxml import html

def getVideosLinks(content):
    videos = re.findall('(http://[\.\w/_]+\.mp[34])', content)
    return videos

def prepareJSONurl(episode_hash):
    json_url = "http://pepto.portico.net2.tv/playlist/{hash}".format(hash=episode_hash)
    return json_url

def extractEpisodeHash(content):
    tree = html.fromstring(content)
    video_url = tree.xpath('//meta[contains(@http-equiv, "refresh")]/@content')[0].split('=',1)[1]
    episode_hash = re.findall('episode=([\w]+)', video_url)
    return episode_hash[0]

def extractIframeURL(content):
    iframe_url = None
    tree = html.fromstring(content)
    try:
        iframe_url = tree.xpath('//iframe/@src')[0]
        is_video = True
    except:
        is_video = False
    return is_video, iframe_url


POPSCI_URL = "http://www.popsci.com/thorium-dream"

response = requests.get(POPSCI_URL)
is_video, iframe_url = extractIframeURL(response.content)

if is_video:
    response_from_iframe_url = requests.get(iframe_url)
    episode_hash = extractEpisodeHash(response_from_iframe_url.content)

    json_url = prepareJSONurl(episode_hash)
    final_response = requests.get(json_url)

    for video in getVideosLinks(final_response.content):
        print "Video: {}".format(video)
else:
    print "This is not a POPSCI video page :|"

They have different video qualities and sizes, so you will see more than one .mp4 video URL for each episode.

This code works for any POPSCI episodes page, try changing POPSCI_URL to...

POPSCI_URL = "http://www.popsci.com/maker-faire-2015"

... and it will still work.

ADDED:

Even so it is not recommended to parse HTML with Regular Expressions (regexp) I have created a regexp version for you (as requested). It works but regular expressions could be improved:

import re
import requests

def getVideosLinks(content):
    videos = re.findall('(http://[\.\w/_]+\.mp[34])', content)
    return videos

def prepareJSONurl(episode_hash):
    json_url = "http://pepto.portico.net2.tv/playlist/{hash}".format(hash=episode_hash)
    return json_url

def extractEpisodeHash(content):
    episode_hash = re.findall('<meta http-equiv="refresh" content="0; url=http:\/\/player\.net2\.tv\?episode=([\w]+)&restart', content)[0]
    return episode_hash

def extractIframeURL(content):
    iframe_url = None
    try:
        iframe_url = re.findall('<iframe src="(.*)" style', content)[0]
        is_video = True
    except:
        is_video = False
    return is_video, iframe_url


POPSCI_URL = "http://www.popsci.com/thorium-dream"

response = requests.get(POPSCI_URL)
is_video, iframe_url = extractIframeURL(response.content)

if is_video:
    response_from_iframe_url = requests.get(iframe_url)
    episode_hash = extractEpisodeHash(response_from_iframe_url.content)

    json_url = prepareJSONurl(episode_hash)
    final_response = requests.get(json_url)

    for video in getVideosLinks(final_response.content):
        print "Video: {}".format(video)
else:
    print "This is not a POPSCI video page :|"

Hope this helps

Upvotes: 2

Related Questions