Saswati
Saswati

Reputation: 43

Web scraping for IMDB unable to retrieve desired columns

I have tried web scraping on IMDB website. I am looking for Top 50 Horror Movies. I want to scrape the movie name, rating, director name, genre, and runtime.

I inspected element for movie name Movie name

Inspect elements for rating and director names

rating director name

Inspect element for runtime, genre

runtime, genre

I wrote a code after inspecting those elements for title, directors name, rating, runtime, genre.

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
my_url = 'https://www.imdb.com/search/title/?genres=horror&title_type=feature&explore=genres'
#r = requests.get(my_url, headers=headers)#, proxies=proxies)
request=urllib.request.Request(my_url,None,headers)
response = urllib.request.urlopen(request)
page_html = response.read()
page_soup = BeautifulSoup(page_html,"html.parser")
page_soup.h1
page_soup.body.span
containers = page_soup.findAll("div",{"class":"lister-item mode-advanced"})
print(len(containers))

for container in containers:
  title=container.findAll("a",{"class": "lister-item-index unbold-text-primary"})
  rating = container.findAll("div",{"class":"inline-block.ratings-imdb-rating"})
  duration = container.findAll("span",{"class":"runtime"})
  genre = container.findAll("span",{"class":"genre"})
  director = container.findAll("p",{"class":"text-muted"})

print(title)
print(rating)
print(duration)
print(genre)
print(director) 

However, my code unable to retrieve those attributes.

Output:

50
[]
[]
[<span class="runtime">90 min</span>]
[<span class="genre">
Horror, Mystery, Thriller            </span>]
[<p class="text-muted ">
<span class="runtime">90 min</span>
<span class="ghost">|</span>
<span class="genre">
Horror, Mystery, Thriller            </span>
</p>, <p class="text-muted">
    A decades-old folk tale surrounding a deranged murderer killing those who celebrate Valentine's Day turns out to be true to legend when a group defies the killer's order and people start turning up dead.</p>]

It would be helpful if anyone help me to find out what I was missing.

Upvotes: 0

Views: 735

Answers (2)

chitown88
chitown88

Reputation: 28640

HTML is like a tree like structure. You want to find the parent nodes, then iterate through those to get what's within those. This site is pretty good to practice with. Director is the only tricky bit as its in a <p> tag, but no attribute to distinguish it. So you need to do a little logic to get it. (Note you could use regex to find it, but wanted to show you a loop since you are learning). I also attached images so you can see where I'm getting those tags and attributes:

import requests
from bs4 import BeautifulSoup


headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
my_url = 'https://www.imdb.com/search/title/?genres=horror&title_type=feature&explore=genres'

response = requests.get(my_url, headers=headers)
page_html = response.text
page_soup = BeautifulSoup(page_html,"html.parser")


movies = page_soup.find_all('div',{'class':'lister-item-content'})
for movie in movies:
    title = movie.find('h3').find('a').text
    try:
        rating = movie.find('p').find('span', {'class':'certificate'}).text
    except:
        rating = ''
    genre = movie.find('p').find('span', {'class':'genre'}).text.strip()
    try:
        runtime = movie.find('p').find('span', {'class':'runtime'}).text
    except:
        runtime = ''
    ps = movie.find_all('p')
    for p in ps:
        if 'Director'in p.text:
            director =p.find('a').text
            
    print(title, rating, genre, runtime, director)

Output:

Wrong Turn 18 Horror, Thriller 109 min Mike P. Nelson
Willy's Wonderland 15 Action, Comedy, Horror 88 min Kevin Lewis
Red Dot 15 Drama, Horror, Thriller 86 min Alain Darborg
Saint Maud 15 Drama, Horror, Mystery 84 min Rose Glass
Freaky 15 Comedy, Horror, Thriller 102 min Christopher Landon
Doctor Strange in the Multiverse of Madness  Action, Adventure, Fantasy  Sam Raimi
Midsommar 18 Drama, Horror, Mystery 148 min Ari Aster
Fear of Rain PG-13 Drama, Horror, Thriller 109 min Castille Landon
The Little Stranger 12A Drama, Horror, Mystery 111 min Lenny Abrahamson
Army of the Dead R Action, Crime, Horror  Zack Snyder
Get Out 15 Horror, Mystery, Thriller 104 min Jordan Peele
Synchronic 15 Drama, Horror, Sci-Fi 102 min Justin Benson
The Rental 15 Drama, Horror, Mystery 88 min Dave Franco
Shadow in the Cloud R Action, Horror, War 83 min Roseanne Liang
Don't Worry Darling  Horror, Thriller  Olivia Wilde
Venom: Let There Be Carnage  Action, Horror, Sci-Fi  Andy Serkis
The Shining 15 Drama, Horror 146 min Stanley Kubrick
The Witch 15 Drama, Horror, Mystery 92 min Robert Eggers
Split 15 Horror, Thriller 117 min M. Night Shyamalan
Hereditary 15 Drama, Horror, Mystery 127 min Ari Aster
Wrong Turn 18 Horror, Thriller 84 min Rob Schmidt
Antebellum 15 Drama, Horror, Mystery 105 min Gerard Bush
Possessor 18 Horror, Sci-Fi, Thriller 103 min Brandon Cronenberg
The New Mutants 15 Action, Horror, Sci-Fi 94 min Josh Boone
Doctor Sleep 15 Drama, Fantasy, Horror 152 min Mike Flanagan
The Invisible Man R Drama, Horror, Mystery 124 min Leigh Whannell
The Meg 12A Action, Horror, Sci-Fi 113 min Jon Turteltaub
Alien X Horror, Sci-Fi 117 min Ridley Scott
The Lighthouse 15 Drama, Fantasy, Horror 109 min Robert Eggers
Scream  Horror, Mystery, Thriller  Matt Bettinelli-Olpin
Run PG-13 Horror, Mystery, Thriller 90 min Aneesh Chaganty
Porno 18 Comedy, Horror 98 min Keola Racela
The Hunt 15 Action, Horror, Thriller 90 min Craig Zobel
Becky 18 Action, Crime, Drama 93 min Jonathan Milott
It 15 Horror 135 min Andy Muschietti
Dark Water 15 Drama, Horror, Mystery 105 min Walter Salles
A Quiet Place Part II 15 Drama, Horror, Sci-Fi 97 min John Krasinski
A Quiet Place 15 Drama, Horror, Sci-Fi 90 min John Krasinski
The Witches PG Adventure, Comedy, Family 106 min Robert Zemeckis
Resident Evil  Action, Horror, Mystery  Johannes Roberts
Us 15 Horror, Mystery, Thriller 116 min Jordan Peele
Psycho Goreman  Comedy, Horror, Sci-Fi 95 min Steven Kostanski
The Empty Man 18 Crime, Drama, Horror 137 min David Prior
From Dusk Till Dawn 18 Action, Crime, Horror 108 min Robert Rodriguez
The Platform 18 Horror, Sci-Fi, Thriller 94 min Galder Gaztelu-Urrutia
The Conjuring 3  Horror, Mystery, Thriller  Michael Chaves
Underwater 15 Action, Horror, Sci-Fi 95 min William Eubank
My Bloody Valentine 18 Horror, Mystery, Thriller 101 min Patrick Lussier
Sputnik 15 Drama, Horror, Sci-Fi 113 min Egor Abramenko
My Bloody Valentine X Horror, Mystery, Thriller 90 min George Mihalka

enter image description here

enter image description here

enter image description here

enter image description here

Upvotes: 1

Jonathan Leon
Jonathan Leon

Reputation: 5648

You're not handling your lists correctly. Had to be more specific on the tags and ways to search for data. And changed the findall to find.

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
my_url = 'https://www.imdb.com/search/title/?genres=horror&title_type=feature&explore=genres'
page = requests.get(my_url, headers=headers)
page_soup = BeautifulSoup(page.text,"html.parser")
for container in containers:
  print(container.find("a", href=re.compile('adv_li_tt')).text)
  print(container.find("strong").text)
  print(container.find("span",{"class":"runtime"}).text)
  print(container.find("span",{"class":"genre"}).text.strip())
  print(container.find('a', href=re.compile('adv_li_dr_0')).text)
  print('\n')

Output

Wrong Turn
5.4
109 min
Horror, Thriller
Mike P. Nelson


Willy's Wonderland
5.7
88 min
Action, Comedy, Horror
Kevin Lewis


Red Dot
5.5
86 min
Drama, Horror, Thriller
Alain Darborg

Upvotes: 1

Related Questions