Reputation: 95
I am trying to fetch top 250 movies list from IMDB using LXML but its returning empty list can tell me whats the mistake i have made.
from lxml.html import parse
tree = parse('http://www.imdb.com/chart/top')
movies = tree.findall('.//table[2]//a')
movies list is empty []
Upvotes: 1
Views: 481
Reputation: 89325
Your XPath doesn't correspond to any element in the linked page as I tested in browser using firepath ("no matching nodes" was returned).
This is one way that worked for me :
from lxml.html import parse
tree = parse('http://www.imdb.com/chart/top')
movies = tree.xpath("//table[contains(@class, 'chart')]//td[@class='titleColumn']/a/text()")
for movie in movies:
print movie
Better use xpath()
method which provide full support of XPath 1.0 expression. Brief explanation of the XPath parameter being used above is as follow :
//table[contains(@class, 'chart')]
: find table
element, anywhere in the HTML document, which class
attribute contains text "chart"
//td[@class='titleColumn']
: then find td
element, anywhere within the aforementioned table
, where class
attribute value equals "titleColumn"
/a/text()
: then from such td
, find child element a
and return its text contentPart of the above snippet output :
The Shawshank Redemption
The Godfather
The Godfather: Part II
The Dark Knight
Pulp Fiction
.....
Upvotes: 1
Reputation: 381
My guess is that you're using a wrong XPath for parsing, using Firebug , the correct xpath for the movies table is
/html/body/div[1]/div/div[4]/div[3]/div/div[1]/div/span/div/div/div[2]/table/tbody
This will return a table which contains all the movies data .
You will need some more processing to get each movie info .
I'd like also to suggest to use the requests lib for HTTP queries
Upvotes: 0