amkyp
amkyp

Reputation: 117

why beautiful soup select method returns None?

I needed to extract youtube links with their names, from youtube playlists. So I just tried to use SelectorGadget(Chrome Extension) for extracting CSS tag, but when I'm trying to get anything about the like BeautifulSoup returns none, I don't where am I going wrong.

below is the code I wrote:

from os import sys
import requests
from bs4 import BeautifulSoup
import re

try:
    # checking url format
    url_pattern = re.compile("^(?:http|https|ftp):\/\/[a-zA-Z0-9_~:\-\/?#[\]@!$&'()*+,;=`^.%]+\.[a-zA-Z0-9_~:\-\/?#[\]@!$&'()*+,;=`^.%]+$") 

    # playlist_url = input("Enter your youtbe playlist url: ")
    # getting input directly from user commandline
    playlist_url = sys.argv[1]

    if not bool(url_pattern.match(playlist_url)) :
        raise ValueError("Enter valid link")

    get_links_from_youtube_playlist(playlist_url)

except ValueError as value_error:
    print(value_error)

then I will pass the URL to another function:


def get_links_from_youtube_playlist(youtube_playlist_url):

    request_response = requests.get(youtube_playlist_url)

    # using "html.parser" lib
    # soup_object = BeautifulSoup(request_response.text, 'html.parser')
    # using "lxml" - Processing XML and HTML with Python
    soup_object = BeautifulSoup(request_response.text, 'lxml')

    # not working?!
    url_list = soup_object.select("#video-title")
    print(url_list)
    # this is not working too?!
    div_content = soup_object.find("div", attrs={"class" : "content"})
    print(div_content)

Also, I run it via below command:

python3 test.py https://www.youtube.com/playlist\?list\=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab

My output is None when printing the BeautifulSoup object after either select or fenter code hereind methods. Shouldn't it find anything meaningful because the id is present in the page?

selector gadget shows me #video-title only when clicking on that section, even I could not access the div how should I extract link and link's name?

Upvotes: 0

Views: 1248

Answers (1)

Barmar
Barmar

Reputation: 781848

YouTube checks the user agent to determine what kind of page to return. If you send the user agent corresponding to a real browser, you'll get the response you expect. video-title is a class, not an ID, so change the selector to .video-title.

import pprint
from bs4 import BeautifulSoup
import requests

pp = pprint.PrettyPrinter()

def get_links_from_youtube_playlist(youtube_playlist_url):

    request_response = requests.get(youtube_playlist_url, headers={"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"})

    soup_object = BeautifulSoup(request_response.text, 'lxml')
    url_list = soup_object.select(".video-title")
    pp.pprint(url_list)

get_links_from_youtube_playlist('https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab')

Output:

[<div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>]

Upvotes: 1

Related Questions