Reputation: 21
I want to scrape Directors and Actors from IMDB
from a single webpage which lists top 50 films of 2018. The issue I have is that I have no idea how to scrape them as the class has no name.
Part of my code which is working fine:
response = requests.get('https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1')
for i in soup.find_all('div', class_ = 'lister-item-content'):
film_lenght = film_details.find('span', class_='runtime').text
film_genre = film_details.find('span', class_='genre').text
public_rating = i.find('div', class_='ratings-bar').strong.text
Part of the HTML code that I don't know how to work with:
</p>, <p class="">
Directors:
<a href="/name/nm0751577/">Anthony Russo</a>,
<a href="/name/nm0751648/">Joe Russo</a>
<span class="ghost">|</span>
Stars:
<a href="/name/nm0000375/">Robert Downey Jr.</a>,
<a href="/name/nm1165110/">Chris Hemsworth</a>,
<a href="/name/nm0749263/">Mark Ruffalo</a>,
<a href="/name/nm0262635/">Chris Evans</a>
</p>
I want to be able to pull all Directors and all listed Actors for each film. I want to do that from the single URL as provided in the code.
Upvotes: 0
Views: 273
Reputation: 21
QHarr's answer was great but later I've noticed that some films do not have Director(s) listed at all; in such cases the code ignored these films. Therefore, I updated QHarr's code and now it takes such scenario into account:
'''
for item in soup.select('p:contains("Stars:")'):
reqs += 1
if item not in soup.select('p:contains("Director:"), p:contains("Directors:")'):
actors = [d.text for d in item.select('a:not(span ~ a)')]
directors = ['none']
else:
directors = str([d.text for d in item.select('a:not(span ~ a)')]).strip('[]').replace("'","")
actors = [d.text for d in item.select('span ~ a')]
'''
Upvotes: 1
Reputation: 84455
You can use :contains
, and specify Director:
or Directors:
, to target the blocks for each film; then separate the director(s) by grabbing a
tags before the span
tag (by filtering out those after). The actors will be the general a
tag siblings of the span
tag. Requires bs4 v 4.7.1
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1')
soup = bs(r.content, 'lxml')
for item in soup.select('p:contains("Director:"), p:contains("Directors:")'):
#print(item)
directors = [d.text for d in item.select('a:not(span ~ a)')]
actors = [d.text for d in item.select('span ~ a')]
print(directors, actors)
Upvotes: 1