Reputation: 117
I needed to extract youtube links with their names, from youtube playlists.
So I just tried to use SelectorGadget
(Chrome Extension) for extracting CSS tag, but when I'm trying to get anything about the like BeautifulSoup returns none
, I don't where am I going wrong.
below is the code I wrote:
from os import sys
import requests
from bs4 import BeautifulSoup
import re
try:
# checking url format
url_pattern = re.compile("^(?:http|https|ftp):\/\/[a-zA-Z0-9_~:\-\/?#[\]@!$&'()*+,;=`^.%]+\.[a-zA-Z0-9_~:\-\/?#[\]@!$&'()*+,;=`^.%]+$")
# playlist_url = input("Enter your youtbe playlist url: ")
# getting input directly from user commandline
playlist_url = sys.argv[1]
if not bool(url_pattern.match(playlist_url)) :
raise ValueError("Enter valid link")
get_links_from_youtube_playlist(playlist_url)
except ValueError as value_error:
print(value_error)
then I will pass the URL to another function:
def get_links_from_youtube_playlist(youtube_playlist_url):
request_response = requests.get(youtube_playlist_url)
# using "html.parser" lib
# soup_object = BeautifulSoup(request_response.text, 'html.parser')
# using "lxml" - Processing XML and HTML with Python
soup_object = BeautifulSoup(request_response.text, 'lxml')
# not working?!
url_list = soup_object.select("#video-title")
print(url_list)
# this is not working too?!
div_content = soup_object.find("div", attrs={"class" : "content"})
print(div_content)
Also, I run it via below command:
python3 test.py https://www.youtube.com/playlist\?list\=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab
My output is None when printing the BeautifulSoup object after either select
or fenter code here
ind methods. Shouldn't it find anything meaningful because the id is present in the page?
selector gadget shows me #video-title
only when clicking on that section, even I could not access the div
how should I extract link and link's name?
Upvotes: 0
Views: 1248
Reputation: 781848
YouTube checks the user agent to determine what kind of page to return. If you send the user agent corresponding to a real browser, you'll get the response you expect. video-title
is a class, not an ID, so change the selector to .video-title
.
import pprint
from bs4 import BeautifulSoup
import requests
pp = pprint.PrettyPrinter()
def get_links_from_youtube_playlist(youtube_playlist_url):
request_response = requests.get(youtube_playlist_url, headers={"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"})
soup_object = BeautifulSoup(request_response.text, 'lxml')
url_list = soup_object.select(".video-title")
pp.pprint(url_list)
get_links_from_youtube_playlist('https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab')
Output:
[<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>]
Upvotes: 1