A_B
A_B

Reputation: 29

How to scrape list of titles from a webpage?

I am trying to scrape the list of courses available on Udacity Website

https://www.udacity.com/courses/all

The webpage has list of courses. I am trying to get the names of course, i.e. all the aria-labels.

I have tried to get it as follow, but I am not getting any output:

soup = BeautifulSoup(r.text, "html.parser")
name = soup.find_all("class", class_= "card_container__25DrK")

enter image description here

Upvotes: 1

Views: 3623

Answers (3)

NetrobeWebby
NetrobeWebby

Reputation: 116

Method 1:

Recommended method, fastest and the best, this gets all the titles, and you can tweak code to get all skills on Udacity too
With basic monitoring on Udacity course list network in chrome developer tools, there you will find out that where they get their list from is https://www.udacity.com/data/catalog.json, with that we can get pure json and parse out results very fast with python JSON module.

import requests
import json


# Get Main course content
url = "https://www.udacity.com/data/catalog.json"
response = requests.get(url)

# Load json data from the response
data_store = json.loads(response.text)

titles = []

# Get the titles
for option in data_store:
    if option['type'] == 'course':
        titles.append(option['payload']['title'])

print(titles)

Method 2:

Does not get all the titles, at the time of testing there are 272 titles, this gets just 172.
You can scrape it all with just BeautifulSoup and Json. Check their page source you will find this tag <script id="__NEXT_DATA__" type="application/json">, it contains all the data on their site in json. You can just take it, parse it to python dictionary and drill out your titles. :-)

from bs4 import BeautifulSoup
import requests
import json

# Get website content
url = "https://www.udacity.com/courses/all"
response = requests.get(url)

# Parse content with html parser and beautiful soup
soup = BeautifulSoup(response.text, "html.parser")
script = soup.find("script", id="__NEXT_DATA__")

# Load json data from script tag text scraped above
data_store = json.loads(script.text)

titles = []

# Get the data from the shared_store key in the data_store dictionary
shared_store = data_store["props"]["pageProps"]["header"]["store"]["__SHARED_STORE__"]

# There are two important keys in the shared_store (popular, schoolToPrograms)
# The titles that are first shown in the site are the contents in the `popular` key
schoolToPrograms = shared_store['schoolToPrograms']
popular = shared_store['popularPrograms']

# I want to add these first because they are the titles shown first in the website
for obj in popular:
    titles.append(obj['name'])

# These are the rest on the contents
for obj in schoolToPrograms:
    for item in obj['items']:
        titles.append(item["name"])

print(titles)

Upvotes: 1

Prins
Prins

Reputation: 1051

The main issue here is that BeautifulSoup by itself only performs static scraping i.e. gets just the static HTML. You will need to use something like Selenium with BeautifulSoup to scrape dynamically generated HTML.

You may find the following tutorial useful: WebScraping with BeautifulSoup and Selenium

Additionally, you should also ensure the correct tag is being targeted. For example, in your screen-shot, the target is an anchor tag so your find_all should be as follows:

name = soup.find_all('a', class_='card_container__25DrK')

However, do check the HTML retrieved by your program to make sure you are targeting the correct tag and specifying the correct attribute value.

Upvotes: -1

Ethicist
Ethicist

Reputation: 827

The issue with just creating a soup using the initial html content is that that site reasonably doesn't load everything at once and places additional courses dynamically possibly to have a lower initial page load time. To solve this you can use something like Selenium for Python.

Then, we'll use CSS Selectors to select h2 elements with a class attribute containing "card_title" (I viewed the source on that site and it looks like that's how courses are displayed).

You'll need to download a driver for Selenium, I'm using Chrome on Windows here so I downloaded chromedriver.exe from the list of available drivers (ChromeDriver 104.0.5112.79) for the latest stable release.

Example code:

from bs4 import BeautifulSoup
from selenium import webdriver    

options = webdriver.ChromeOptions()
options.add_argument('--headless')

# I'm using Chrome in this example, you can search online for more on
# how Selenium works. This executable path points to where I downloaded it
browser = webdriver.Chrome(options=options, executable_path=r'C:\Users\User\Downloads\chromedriver_win32\chromedriver.exe')
browser.get("https://www.udacity.com/courses/all")

html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')

# match h2 elements with a class containing "card_title"
for course in soup.select('h2[class*="card_title"]'):
    course_name = course.get_text()
    # do something with course_name, e.g add it to a list
    print(course_name)

browser.quit()

Output:

Data Engineer
Business Analytics
Product Manager
Programming for Data Science with Python
Introduction to Programming
Data Scientist
Data Analyst
C++
React
Blockchain Developer
Self-Driving Car Engineer
Machine Learning DevOps Engineer
Deep Learning
SQL
Front End Web Developer
Full Stack Web Developer
Java Programming
Digital Marketing
Artificial Intelligence for Trading
Data Structures and Algorithms
UX Designer
Java Developer
AWS Machine Learning Engineer
Intermediate Python
AI Programming with Python
Growth Product Manager
Intro to Self-Driving Cars
Cloud DevOps Engineer
Robotics Software Engineer
Deep Reinforcement Learning
Data Architect
Android Kotlin Developer
Computer Vision
Data Analysis and Visualization with Microsoft Power BI
Natural Language Processing
Cloud Developer
Zero Trust Security
Data Streaming
AI Product Manager
Introduction to Cybersecurity
iOS Developer
Data Engineering with Microsoft Azure
Intro to Machine Learning with TensorFlow
AWS Cloud Architect
Full Stack JavaScript Developer
Digital Project Management
Cloud Native Application Architecture
Intro to Machine Learning with PyTorch
Data Product Manager
Flying Car and Autonomous Flight Engineer
Sensor Fusion Engineer
Ethical Hacker
Predictive Analytics For Business
Intermediate JavaScript
Android Basics
Artificial Intelligence
Agile Software Development
Marketing Analytics
Data Visualization
Cloud DevOps using Microsoft Azure
Digital Freelancer
AI for Healthcare
Hybrid Cloud Engineer
Data Science for Business Leaders
AI for Business Leaders
Privacy Engineer
Site Reliability Engineer
Security Engineer
Cloud Developer using Microsoft Azure
Cloud Architect using Microsoft Azure
Machine Learning Engineer for Microsoft Azure
Security Architect
AI Engineer using Microsoft Azure
Data Privacy
Security Analyst
Enterprise Security
Intel® Edge AI for IoT Developers
Cloud Computing for Business Leaders
Programming for Data Science with R
RPA Developer with UiPath
Cybersecurity for Business Leaders
Intro to Information Security
Cyber-Physical Systems Security
Network Security
Getting Started with Google Workspace
Rapid Prototyping
Creating an Analytical Dataset
Problem Solving with Advanced Analytics
Classification Models
Product Design
Segmentation and Clustering
Time Series Forecasting
App Marketing
App Monetization
A/B Testing for Business Analysts
How to Build a Startup
Get Your Startup Started
Managing Remote Teams with Upwork
Google Cloud Digital Leader Training
Cloud Native Fundamentals
Hybrid Cloud Fundamentals
Intro to Data Analysis
SQL for Data Analysis
Database Systems Concepts & Design
Intro to Inferential Statistics
Spark
Data Analysis and Visualization
Cyber-Physical Systems Design & Analysis
Differential Equations in Action
Self-Driving Fundamentals: Featuring Apollo
AWS Machine Learning Foundations Course
Introduction to Machine Learning using Microsoft Azure
AI Fundamentals
Linear Algebra Refresher Course
Machine Learning: Unsupervised Learning
Big Data Analytics in Healthcare
Intel® Edge AI Fundamentals with OpenVINO™
Artificial Intelligence
Secure and Private AI
Model Building and Validation
Data Visualization and D3.js
Machine Learning for Trading
Machine Learning
Intro to Hadoop and MapReduce
Real-Time Analytics with Apache Storm
A/B Testing
Data Analysis with R
Knowledge-Based AI: Cognitive Systems
Introduction to TensorFlow Lite
Introduction to Computer Vision
Intro to TensorFlow for Deep Learning
Eigenvectors and Eigenvalues
Intro to Artificial Intelligence
Artificial Intelligence for Robotics
Intro to Deep Learning with PyTorch
AWS DeepRacer
Reinforcement Learning
Introduction to Machine Learning Course
Product Manager Interview Preparation
Microsoft Power Platform
Web Tooling & Automation
Front End Frameworks
Responsive Web Design Fundamentals
How to Install Android Studio
Android Basics: Multiscreen Apps
Website Performance Optimization
iOS Networking with Swift
JavaScript Design Patterns
Android Basics: User Input
Android Performance
Responsive Images
Xcode Debugging
Gradle for Android and Java
Build Native Mobile Apps with Flutter
JavaScript Promises
UIKit Fundamentals
Android Basics: User Interface
Client-Server Communication
What is Programming?
Building High Conversion Web Forms
Advanced Android App Development
Software Architecture & Design
Authentication & Authorization: OAuth
Intro to iOS App Development with Swift
Introduction to Operating Systems
Android Basics: Networking
Web Accessibility
Android Basics: Data Storage
Scalable Microservices with Kubernetes
Developing Android Apps with Kotlin
Browser Rendering Optimization
Learn Swift Programming Syntax
Offline Web Applications
Kotlin for Android Developers
UX Design for Mobile Developers
Software Development Process
Data Visualization in Tableau
Intro to Progressive Web Apps
Writing READMEs
Software Analysis & Testing
iOS Persistence and Core Data
Computer Networking
Firebase Analytics: iOS
Human-Computer Interaction
2D Game Development with libGDX
Intro to jQuery
How to create <anything> in Android
Introduction to Graduate Algorithms
Dynamic Web Applications with Sinatra
How to Make a Platformer Using libGDX
JavaScript Testing
Object-Oriented JavaScript
Localization Essentials
Compilers: Theory and Practice
HTML5 Canvas
Object Oriented Programming in Java
Designing RESTful APIs
GT - Refresher - Advanced OS
Intro to JavaScript
Grand Central Dispatch (GCD)
Continuous Integration and Deployment
Swift for Beginners
Intro to Statistics
Intro to HTML and CSS
Developing Android Apps
Introduction to Python Programming
Introduction to Virtual Reality
Objective-C for Swift Developers
Interactive 3D Graphics
Full Stack Foundations
High Performance Computer Architecture
AutoLayout
Kotlin Bootcamp for Programmers
Shell Workshop
Core ML: Machine Learning for iOS
Statistics
Intro to Theoretical Computer Science
Design of Computer Programs
Data Wrangling with MongoDB
Swift for Developers
Firebase in a Weekend: Android
Software Debugging
Deploying a Hadoop Cluster
Server-Side Swift
Networking for Web Developers
Intro to Physics
Intro to Relational Databases
ES6 - JavaScript Improved
Mobile Design and Usability for iOS
Intro to AJAX
Intro to Algorithms
The MVC Pattern in Ruby
WeChat Mini Program Development
Asynchronous JavaScript Requests
Embedded Systems
High Performance Computing
HTTP & Web Servers
Advanced Android with Kotlin
Computability, Complexity & Algorithms
Advanced Operating Systems
Passwordless Login Solutions for iOS
Version Control with Git
Firebase in a Weekend: iOS
Intro to Point & Click App Development
Deploying Applications with Heroku
Applied Cryptography
Java Programming Basics
C++ For Programmers
Intro to Backend
JavaScript and the DOM
Firebase Analytics: Android
Configuring Linux Web Servers
How to Make an iOS App
Intro to DevOps
Google Maps APIs
Passwordless Login Solutions for Android
Mobile Design and Usability for Android
iOS Design Patterns
Intro to Psychology
Engagement & Monetization | Mobile Games
Material Design for Android Developers
Craft Your Cover Letter
Refresh Your Resume
Strengthen Your LinkedIn Network & Brand
Data Science Interview Prep
Android Interview Prep
Machine Learning Interview Preparation
Front-End Interview Prep
Full-Stack Interview Prep
Data Structures & Algorithms in Swift
iOS Interview Prep
VR Interview Prep

Upvotes: 3

Related Questions