Anthony
Anthony

Reputation: 111

Scrapy iterate through starting urls and domains

I am attempting to read a list of urls and domains from csv and have a Scrapy spider iterate through the list of domains and starting urls with the goal of having all urls within that domain exported to a csv file through my pipeline.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from NONPROF.items import NonprofItem
from scrapy.http import Request
import pandas as pd


file_path = 'C:/csv'
open_list = pd.read_csv(file_path)
urlorgs = open_list.http.tolist()

open_list2 = pd.read_csv(file_path)
domainorgs = open_list2.domain.tolist()



class Nonprof(CrawlSpider):
        name = "responselist"
    for domain in domainorgs:
        allowed_domains = [domain]
    for url in urlorgs:
        start_urls = [url]

        rules = [
            Rule(LinkExtractor(
                allow=['.*']),
                 callback='parse_item',
                 follow=True)
            ]

        def parse_item (self, response):
            item = NonprofItem()
            item['responseurl'] = response.url
            yield item

When I run the spider it will either give me an indention error, or when I make adjustments to indention then it will only recognize the last domain in the list.

Any recommendations on how to accomplish this are appreciated.

Upvotes: 0

Views: 734

Answers (2)

gangabass
gangabass

Reputation: 10666

Fix your indentation and try this:

for domain in domainorgs:
    allowed_domains.append(domain)
for url in urlorgs:
    start_urls.append(url)

Upvotes: 0

kszl
kszl

Reputation: 1213

This code you pasted has terrible indentation I am not surprised the interpreter complains. But most likely this is your problem:

allowed_domains = [domain]

It creates a new list containing just one domain and assigns it to allowed_domains. So the last domain overrides everything that was saved there before. Fix it by doing:

allowed_domains = []
for domain in domainorgs:
    allowed_domains += [domain]

or even like that (without the loop):

allowed_domains = domainorgs

Upvotes: 2

Related Questions