Scrapy iterate through starting urls and domains

Question

I am attempting to read a list of urls and domains from csv and have a Scrapy spider iterate through the list of domains and starting urls with the goal of having all urls within that domain exported to a csv file through my pipeline.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from NONPROF.items import NonprofItem
from scrapy.http import Request
import pandas as pd


file_path = 'C:/csv'
open_list = pd.read_csv(file_path)
urlorgs = open_list.http.tolist()

open_list2 = pd.read_csv(file_path)
domainorgs = open_list2.domain.tolist()



class Nonprof(CrawlSpider):
        name = "responselist"
    for domain in domainorgs:
        allowed_domains = [domain]
    for url in urlorgs:
        start_urls = [url]

        rules = [
            Rule(LinkExtractor(
                allow=['.*']),
                 callback='parse_item',
                 follow=True)
            ]

        def parse_item (self, response):
            item = NonprofItem()
            item['responseurl'] = response.url
            yield item

When I run the spider it will either give me an indention error, or when I make adjustments to indention then it will only recognize the last domain in the list.

Any recommendations on how to accomplish this are appreciated.

kszl · Accepted Answer

This code you pasted has terrible indentation I am not surprised the interpreter complains. But most likely this is your problem:

allowed_domains = [domain]

It creates a new list containing just one domain and assigns it to allowed_domains. So the last domain overrides everything that was saved there before. Fix it by doing:

allowed_domains = []
for domain in domainorgs:
    allowed_domains += [domain]

or even like that (without the loop):

allowed_domains = domainorgs

Scrapy iterate through starting urls and domains

Answers (2)

Related Questions