HelloThereToad
HelloThereToad

Reputation: 279

How to get specific data using BeautifulSoup

I'm not sure how to get a specific result from this:

<div class="videoPlayer">
    <div class="border-radius-player">
        <div id="allplayers" style="position:relative;width:100%;height:100%;overflow: hidden;">
            <div id="box">
                <div id="player_content" class="todo" style="text-align: center; display: block;">
                     <div id="player" class="jwplayer jew-reset jew-skin-seven jw-state-paused jw-flag-user-inactive" tabindex="0">
                         <div class="jw-media jw-reset">
                              <video class="jw-video jw-reset" x-webkit-playsinline="" src="https:EXAMPLE-URL-HERE" preload="metadata"></video>
                         </div">

How would I get the src in <video class="jw-video jw-reset" x-webkit-playsinline="" src="https:EXAMPLE-URL-HERE" preload="metadata"></video>

This is what I've tried so far:

import urllib.request
from bs4 import BeautifulSoup

url = "https://someurlhere"

a = urllib.request.Request(url, headers={'User-Agent' : "Cliqz"})
b = urllib.request.urlopen(a) # prevent "Permission denies"

soup = BeautifulSoup(b, 'html.parser')

for video_class in soup.select("div.videoPlayer"):
    print(video_class.text)

Which returns parts of it but not down to video class

Upvotes: 0

Views: 288

Answers (1)

Simas Joneliunas
Simas Joneliunas

Reputation: 3118

Requests is a simple html client, it cannot execute javascripts.

You have three more options to try here though!

  1. try going over the html source (b) and see if any of the javascripts in the site have the data you need. usually, the page would have the url (which, i assume you want to scrape) in some sort of holder (a javascript code or a json object) that you can scrape off.
  2. Try looking at the XHR requests of the site and see if any of the requests query external sources for the video data. In this case, see if you can imitate that request to get the data you need.
  3. (last resort) You need to use a phantomjs + selenium browser to download the website (Link1, Link2). You can find out more about how to use selenium in this SO post: https://stackoverflow.com/a/26440563/3986395

Upvotes: 1

Related Questions