Scraping all links and link content with Scrapy

Question

I am trying to scrape every internal link from IMDB and then scrape the title from each links' page. However, when I run the code below, nothing is returned.

import scrapy
from urllib.parse import urljoin
from FirstSpider.items import MovieItem

class ProductsSpider(scrapy.Spider):

    name = "movies"
    allowed_domains = ["www.imdb.com"]
    start_urls = ('https://www.imdb.com/chart/top',)

    def parse(self, response):
        products = response.xpath('//body/a/@href').extract()
        for p in products:
            url = urljoin(response.url, p)
            yield scrapy.Request(url, callback=self.parse_movie)

    def parse_movie(self, response):
        item = MovieItem()
        item['title'] = response.xpath('//title/text()').extract() 

    return item

I understand that I am most likely missing a line of code within the parse_movie method but have spent all day going in circles and am feeling a bit hopeless. I apologize for not realizing what is probably an easy fix, as I am all to new to scrapy and python. Thank you.

Miguel Garcia · Accepted Answer

You should use //body//a/@href instead of //body/a/@href to get all links. I think you only want the links for movies (there are other links in the page), so change //body//a/@href to '//body//td[@class="titleColumn"]/a/@href'.

I made an IMDB scraper , take a look at it if you wish https://github.com/miguelgarcia/imdb_scraping

Scraping all links and link content with Scrapy

Answers (1)

Related Questions