Reputation: 81
I have a csv file which contains the imdb movieID's of 300 movies. The imdb movie urls for each movie are of the format : https://www.imdb.com/title/ttmovieID
I want to scrape each movie's dedicated site for thumbnail image link,title,actors and year of release and write it to a csv file where each row will contain data for each movie,
Since I have the movieID for each movie in a csv file, what should be the start_urls of my spider and what should be the structure of my parse function? Also, how to write it to a csv file?
I have the following approach for a top 250 page of imdb. What changes should I make in the start_urls and links ?
import scrapy
import csv
from example.items import MovieItem
class ImdbSpider(scrapy.Spider):
name = "imdbtestspider"
allowed_domains = ["imdb.com"]
start_urls = ['http://www.imdb.com/chart/top',]
def parse(self,response):
links=response.xpath('//tbody[@class="lister-list"]/tr/td[@class="titleColumn"]/a/@href').extract()
i=1
for link in links:
abs_url=response.urljoin(link)
url_next='//*[@id="main"]/div/span/div/div/div[2]/table/tbody/tr['+str(i)+']/td[3]/strong/text()'
rating=response.xpath(url_next).extract()
if(i <= len(links)):
i=i+1
yield scrapy.Request(abs_url, callback=self.parse_indetail, meta={'rating' : rating })
def parse_indetail(self,response):
item = MovieItem()
item['title'] = response.xpath('//div[@class="title_wrapper"]/h1/text()').extract()[0][:-1]
item['director'] = response.xpath('//div[@class="credit_summary_item"]/span[@itemprop="director"]/a/span/text()').extract()
return item
Upvotes: 0
Views: 661
Reputation: 2091
You could just read your .csv
file in start_requests
function and yield requests from there. Code could be something like:
import csv
from scrapy import Request
...
def start_requests(self):
with open('imdb_ids.csv') as csv_file:
ids = csv.reader(csv_file, delimiter=',')
line = 0
for id in ids:
if line > 0:
yield Request('https://www.imdb.com/title/ttmovie' + id)
line+=1
Upvotes: 1