Reputation: 154
I tried to scrape a certain information from a webpage but failed miserably. The text I wish to grab is available in the page source but I still can't fetch it. This is the site address. I'm after the portion visible in the image as Not Rated
.
Relevant html:
<div class="subtext">
Not Rated
<span class="ghost">|</span> <time datetime="PT188M">
3h 8min
</time>
<span class="ghost">|</span>
<a href="/search/title?genres=drama&explore=title_type,genres&ref_=tt_ov_inf">Drama</a>,
<a href="/search/title?genres=musical&explore=title_type,genres&ref_=tt_ov_inf">Musical</a>,
<a href="/search/title?genres=romance&explore=title_type,genres&ref_=tt_ov_inf">Romance</a>
<span class="ghost">|</span>
<a href="/title/tt0150992/releaseinfo?ref_=tt_ov_inf" title="See more release dates">18 June 1999 (India)
</a> </div>
I've tried with:
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
rating = soup.select_one(".titleBar .subtext").next_element
print(rating)
I get None using the script above.
Expected output:
Not Rated
How can I get the rating from that webpage?
Upvotes: 1
Views: 81
Reputation: 2227
IMDB is one of those webpages that makes it incredible easy to do webscraping and I love it. So what they do to make it easy for webscrapers is to put a script in the top of the html that contains the whole movie object in the format of JSON.
So to get all the relevant information and organize it you simply need to get the content of that single script tag, and convert it to JSON, then you can simply ask for the specific information like with a dictionary.
import requests
import json
from bs4 import BeautifulSoup
#This part is basically the same as yours
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
r = requests.get(link)
soup = BeautifulSoup(r.content,"lxml")
#Why not get the whole json element of the movie?
script = soup.find('script', {"type" : "application/ld+json"})
element = json.loads(script.text)
print(element['contentRating'])
#Outputs "Not Rated"
# You can also inspect te rest of the json it has all the relevant information inside
#Just -> print(json.dumps(element, indent=2))
Note: Headers and session are not necessary in this example.
Upvotes: 1
Reputation: 195468
If you want to get correct version of HTML page, specify Accept-Language
http header:
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
s.headers['Accept-Language'] = 'en-US,en;q=0.5' # <-- specify also this!
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
rating = soup.select_one(".titleBar .subtext").next_element
print(rating)
Prints:
Not Rated
Upvotes: 1
Reputation: 337
There is a better way to getting info on the page. If you dump the html content returned by the request.
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
with open("response.html", "w", encoding=r.encoding) as file:
file.write(r.text)
you will find a element <script type="application/ld+json">
which contains all the information about the movie.
Then, you simply get the element text, parse it as json, and use the json to extract the info you wanted.
here is a working example
import json
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
movie_data = soup.find("script", attrs={"type": "application/ld+json"}).next # Find the element <script type="application/ld+json"> and get it's content
movie_data = json.loads(movie_data) # parse the data to json
content_rating = movie_data["contentRating"] # get rating
Upvotes: 1