Afiz
Afiz

Reputation: 95

empty list is returned from IMDB using python lxml

I am trying to fetch top 250 movies list from IMDB using LXML but its returning empty list can tell me whats the mistake i have made.

from lxml.html import parse
tree = parse('http://www.imdb.com/chart/top')
movies = tree.findall('.//table[2]//a')

movies list is empty []

Upvotes: 1

Views: 481

Answers (2)

har07
har07

Reputation: 89325

Your XPath doesn't correspond to any element in the linked page as I tested in browser using firepath ("no matching nodes" was returned).

This is one way that worked for me :

from lxml.html import parse
tree = parse('http://www.imdb.com/chart/top')
movies = tree.xpath("//table[contains(@class, 'chart')]//td[@class='titleColumn']/a/text()")
for movie in movies:
    print movie

Better use xpath() method which provide full support of XPath 1.0 expression. Brief explanation of the XPath parameter being used above is as follow :

  • //table[contains(@class, 'chart')] : find table element, anywhere in the HTML document, which class attribute contains text "chart"
  • //td[@class='titleColumn'] : then find td element, anywhere within the aforementioned table, where class attribute value equals "titleColumn"
  • /a/text() : then from such td, find child element a and return its text content

Part of the above snippet output :

The Shawshank Redemption
The Godfather
The Godfather: Part II
The Dark Knight
Pulp Fiction
.....

Upvotes: 1

DeltaWeb
DeltaWeb

Reputation: 381

My guess is that you're using a wrong XPath for parsing, using Firebug , the correct xpath for the movies table is

/html/body/div[1]/div/div[4]/div[3]/div/div[1]/div/span/div/div/div[2]/table/tbody

This will return a table which contains all the movies data .

You will need some more processing to get each movie info .

I'd like also to suggest to use the requests lib for HTTP queries

Upvotes: 0

Related Questions