Reputation: 27
I am using BeautifulSoup to scrape movies in the IMDB website. I was able to scrape name, genre, duration, rating of movies successfully. But I am not able to scrape description of the movies as when I am looking at the classes, it is "text-muted" and since this class is there multiple times holding other data such as rating, genre, duration. But since these data has inner classes also, so it was easier for me to scrape it but when it is coming to description, it does not have any inner class. So when pulling out data just using "text-muted" is giving other data also. How do I just get the description of the movies?
Attaching the code and screenshot for reference:
The sample code which I used to scrape genre is as follows:
genre_tags=data.select(".text-muted .genre")
genre=[g.get_text() for g in genre_tags]
Genre = [item.strip() for item in genre if str(genre)]
print(Genre)
Upvotes: 0
Views: 1312
Reputation: 1
You can use this, :) , if helped you, UP my solution pls.. thks,
from bs4 import BeautifulSoup
from requests_html import HTMLSession
URL = 'https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm' #url of Most Popular Movies in IMDB
PAGE = HTMLSession().get(URL)
PAGE_BS4 = BeautifulSoup(PAGE.html.html,'html.parser')
MoviesObj = PAGE_BS4.find_all("tbody","lister-list") #get table body of Most Popular Movies
for index in range(len(MoviesObj[0].find_all("td","titleColumn"))):
a = list(MoviesObj[0].find_all("td","titleColumn")[index])[1]
href = 'https://www.imdb.com'+a.get('href') #get each link for movie page
moviepage = HTMLSession().get(href) #request each page of movie
moviepage = BeautifulSoup(moviepage.html.html,'html.parser')
title = list(moviepage.find_all('h1')[0].stripped_strings)[0] #parse title
year = list(moviepage.find_all('h1')[0].stripped_strings)[2] #parse year
try:
score = list(moviepage.find_all('div','ratingValue')[0].stripped_strings)[0] #parse score if is available
except IndexError:
score = '-' #if score is not available '-' is filled
description = list(moviepage.find_all('div','summary_text')[0].stripped_strings)[0] #parse description
print(f'TITLE: {title} YEAR: {year} SCORE: {score}\nDESCRIPTION:{description}\n')
Junior Saldanha @UmSaldanha
Upvotes: 0
Reputation: 706
In general, lxml is much better than beautifulsoup.
import requests
from lxml
import html
url = "xxxx"
r = requests.get(url)
tree = html.fromstring(r.text)
rows = tree.xpath('//div[@class="lister-item mode-detail"]')
for row in rows:
description = row.xpath('.//div[@class="ratings-bar"]/following-sibling::p[@class="text-muted"]/text()')[0].strip()
Upvotes: 1