Bharat Chandwani
Bharat Chandwani

Reputation: 27

How do I scrape "description" of movies in the IMDB website using BeautifulSoup?

I am using BeautifulSoup to scrape movies in the IMDB website. I was able to scrape name, genre, duration, rating of movies successfully. But I am not able to scrape description of the movies as when I am looking at the classes, it is "text-muted" and since this class is there multiple times holding other data such as rating, genre, duration. But since these data has inner classes also, so it was easier for me to scrape it but when it is coming to description, it does not have any inner class. So when pulling out data just using "text-muted" is giving other data also. How do I just get the description of the movies?

Attaching the code and screenshot for reference: The red marked area is the class name of the description and the strip below movie name

The sample code which I used to scrape genre is as follows:

genre_tags=data.select(".text-muted .genre")
genre=[g.get_text() for g in genre_tags]
Genre = [item.strip() for item in genre if str(genre)]
print(Genre)

Upvotes: 0

Views: 1312

Answers (2)

Junior Saldanha
Junior Saldanha

Reputation: 1

You can use this, :) , if helped you, UP my solution pls.. thks,

from bs4 import BeautifulSoup
from requests_html import HTMLSession

URL = 'https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm' #url of Most Popular Movies in IMDB

PAGE = HTMLSession().get(URL)
PAGE_BS4 = BeautifulSoup(PAGE.html.html,'html.parser')

MoviesObj = PAGE_BS4.find_all("tbody","lister-list") #get table body of Most Popular Movies
for index in range(len(MoviesObj[0].find_all("td","titleColumn"))):
    a = list(MoviesObj[0].find_all("td","titleColumn")[index])[1]
    href = 'https://www.imdb.com'+a.get('href') #get each link for movie page
    moviepage = HTMLSession().get(href) #request each page of movie
    moviepage = BeautifulSoup(moviepage.html.html,'html.parser')
    title = list(moviepage.find_all('h1')[0].stripped_strings)[0] #parse title
    year = list(moviepage.find_all('h1')[0].stripped_strings)[2] #parse year
    try:
        score = list(moviepage.find_all('div','ratingValue')[0].stripped_strings)[0] #parse score if is available
    except IndexError:
        score = '-' #if score is not available '-' is filled
    description = list(moviepage.find_all('div','summary_text')[0].stripped_strings)[0] #parse description
    print(f'TITLE: {title}      YEAR: {year}       SCORE: {score}\nDESCRIPTION:{description}\n') 
    

PRINT

Junior Saldanha @UmSaldanha

Upvotes: 0

Ahmed Mamdouh
Ahmed Mamdouh

Reputation: 706

In general, lxml is much better than beautifulsoup.

import requests 
from lxml 
import html

url = "xxxx"

r = requests.get(url)

tree = html.fromstring(r.text)

rows = tree.xpath('//div[@class="lister-item mode-detail"]')

for row in rows:
    description = row.xpath('.//div[@class="ratings-bar"]/following-sibling::p[@class="text-muted"]/text()')[0].strip()

Upvotes: 1

Related Questions