Stefano Leone
Stefano Leone

Reputation: 115

Python - XPath issue while scraping the IMDb Website

I am trying to scrape the movies on IMDb using Python and I can get data about all the important aspects but the actors names.

Here is a sample URL that I am working on:

https://www.imdb.com/title/tt0106464/

Using the "Inspect" browser functionality I found the XPath that relates to all actors names, but when it comes to run the code on Python, it looks like the XPath is not valid (does not return anything).

Here is a simple version of the code I am using:

import requests
from lxml import html

movie_to_scrape = "https://www.imdb.com/title/tt0106464"
timeout_time = 5

IMDb_html = requests.get(movie_to_scrape, timeout=timeout_time)
doc = html.fromstring(IMDb_html.text)
actors = doc.xpath('//table[@class="cast_list"]//tbody//tr//td[not(contains(@class,"primary_photo"))]//a/text()')
print(actors)

I tried to change the XPath many times trying to make it more generic and then more specific, but it still does not return anything

Upvotes: 0

Views: 216

Answers (2)

shrewmouse
shrewmouse

Reputation: 6050

From looking at the HTML start with a simple xpath like //td[@class="primary_photo"]

<table class="cast_list">    
  <tr><td colspan="4" class="castlist_label">Cast overview, first billed only:</td></tr>
      <tr class="odd">
          <td class="primary_photo">
<a href="/name/nm0000418/?ref_=tt_cl_i1"
><img height="44" width="32" alt="Danny Glover" title="Danny Glover" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._CB470041625_.png" class="loadlate hidden " loadlate="https://m.media-amazon.com/images/M/MV5BMTI4ODM2MzQwN15BMl5BanBnXkFtZTcwMjY2OTI5MQ@@._V1_UY44_CR1,0,32,44_AL_.jpg" /></a>          </td>
          <td>

PYTHON:

for photo in doc.xpath('//td[@class="primary_photo"]'):
    print photo

Upvotes: 0

m9mhmdy
m9mhmdy

Reputation: 56

Don't blindly accept the markup structure you see using inspect element.
Browser are very lenient and will try to fix any markup issue in the source.
With that being said, if you check the source using view source you can see that the table you're tying to scrape has no <tbody> as they are inserted by the browser.
So if you removed it form here //table[@class="cast_list"]//tbody//tr//td[not(contains(@class,"primary_photo"))]//a/text() -> //table[@class="cast_list"]//tr//td[not(contains(@class,"primary_photo"))]//a/text()
your query should work.

Upvotes: 1

Related Questions