Reputation: 115
I am trying to scrape the movies on IMDb using Python and I can get data about all the important aspects but the actors names.
Here is a sample URL that I am working on:
https://www.imdb.com/title/tt0106464/
Using the "Inspect" browser functionality I found the XPath that relates to all actors names, but when it comes to run the code on Python, it looks like the XPath is not valid (does not return anything).
Here is a simple version of the code I am using:
import requests
from lxml import html
movie_to_scrape = "https://www.imdb.com/title/tt0106464"
timeout_time = 5
IMDb_html = requests.get(movie_to_scrape, timeout=timeout_time)
doc = html.fromstring(IMDb_html.text)
actors = doc.xpath('//table[@class="cast_list"]//tbody//tr//td[not(contains(@class,"primary_photo"))]//a/text()')
print(actors)
I tried to change the XPath many times trying to make it more generic and then more specific, but it still does not return anything
Upvotes: 0
Views: 216
Reputation: 6050
From looking at the HTML start with a simple xpath like //td[@class="primary_photo"]
<table class="cast_list">
<tr><td colspan="4" class="castlist_label">Cast overview, first billed only:</td></tr>
<tr class="odd">
<td class="primary_photo">
<a href="/name/nm0000418/?ref_=tt_cl_i1"
><img height="44" width="32" alt="Danny Glover" title="Danny Glover" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._CB470041625_.png" class="loadlate hidden " loadlate="https://m.media-amazon.com/images/M/MV5BMTI4ODM2MzQwN15BMl5BanBnXkFtZTcwMjY2OTI5MQ@@._V1_UY44_CR1,0,32,44_AL_.jpg" /></a> </td>
<td>
PYTHON:
for photo in doc.xpath('//td[@class="primary_photo"]'):
print photo
Upvotes: 0
Reputation: 56
Don't blindly accept the markup structure you see using inspect element
.
Browser are very lenient and will try to fix any markup issue in the source.
With that being said, if you check the source using view source
you can see that the table you're tying to scrape has no <tbody>
as they are inserted by the browser.
So if you removed it form here
//table[@class="cast_list"]//tbody//tr//td[not(contains(@class,"primary_photo"))]//a/text()
-> //table[@class="cast_list"]//tr//td[not(contains(@class,"primary_photo"))]//a/text()
your query should work.
Upvotes: 1