Python - XPath issue while scraping the IMDb Website

Question

I am trying to scrape the movies on IMDb using Python and I can get data about all the important aspects but the actors names.

Here is a sample URL that I am working on:

https://www.imdb.com/title/tt0106464/

Using the "Inspect" browser functionality I found the XPath that relates to all actors names, but when it comes to run the code on Python, it looks like the XPath is not valid (does not return anything).

Here is a simple version of the code I am using:

import requests
from lxml import html

movie_to_scrape = "https://www.imdb.com/title/tt0106464"
timeout_time = 5

IMDb_html = requests.get(movie_to_scrape, timeout=timeout_time)
doc = html.fromstring(IMDb_html.text)
actors = doc.xpath('//table[@class="cast_list"]//tbody//tr//td[not(contains(@class,"primary_photo"))]//a/text()')
print(actors)

I tried to change the XPath many times trying to make it more generic and then more specific, but it still does not return anything

m9mhmdy · Accepted Answer

Don't blindly accept the markup structure you see using inspect element.
Browser are very lenient and will try to fix any markup issue in the source.
With that being said, if you check the source using view source you can see that the table you're tying to scrape has no as they are inserted by the browser.
So if you removed it form here //table[@class="cast_list"]//tbody//tr//td[not(contains(@class,"primary_photo"))]//a/text() -> //table[@class="cast_list"]//tr//td[not(contains(@class,"primary_photo"))]//a/text()
your query should work.

Python - XPath issue while scraping the IMDb Website

Answers (2)

Related Questions