Reputation: 111
I am attempting to read a list of urls and domains from csv and have a Scrapy
spider iterate through the list of domains and starting urls with the goal of having all urls within that domain exported to a csv file through my pipeline.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from NONPROF.items import NonprofItem
from scrapy.http import Request
import pandas as pd
file_path = 'C:/csv'
open_list = pd.read_csv(file_path)
urlorgs = open_list.http.tolist()
open_list2 = pd.read_csv(file_path)
domainorgs = open_list2.domain.tolist()
class Nonprof(CrawlSpider):
name = "responselist"
for domain in domainorgs:
allowed_domains = [domain]
for url in urlorgs:
start_urls = [url]
rules = [
Rule(LinkExtractor(
allow=['.*']),
callback='parse_item',
follow=True)
]
def parse_item (self, response):
item = NonprofItem()
item['responseurl'] = response.url
yield item
When I run the spider it will either give me an indention error, or when I make adjustments to indention then it will only recognize the last domain in the list.
Any recommendations on how to accomplish this are appreciated.
Upvotes: 0
Views: 734
Reputation: 10666
Fix your indentation and try this:
for domain in domainorgs:
allowed_domains.append(domain)
for url in urlorgs:
start_urls.append(url)
Upvotes: 0
Reputation: 1213
This code you pasted has terrible indentation I am not surprised the interpreter complains. But most likely this is your problem:
allowed_domains = [domain]
It creates a new list containing just one domain and assigns it to allowed_domains
. So the last domain overrides everything that was saved there before. Fix it by doing:
allowed_domains = []
for domain in domainorgs:
allowed_domains += [domain]
or even like that (without the loop):
allowed_domains = domainorgs
Upvotes: 2