Axel Ericsson
Axel Ericsson

Reputation: 61

Url generator for start URLs scrapy (only reads first URL), why?

I am using scrapy as webscraping framework and scraping a number of different domains for a set of companies. I have generated a URL-generator class, which reads a file with companies and generates a start URL for the companies on different webpages(only one example company shown). The web scraper runs fine for the first record, but it does not run for the other URLs. I have tested the URL-generator and it returns all URLs, but for some reason this does not work start_urls = [start_url.company-site()]. Any ideas?

URL generator file.

# -*- coding: utf-8 -*-
import os 
import os.path

class URL(object):
    P=[]

    def read(self, filename):
        with open(filename) as f:
            for line in f:
                field = line.split(',')
                company = field[1].replace(" ", '+')
                adress="{0}+{1}".format(field[5],field[11])
                self.P.append("http://www.companywebpage.com/market-search?q={0}".format(company))

    def company-site(self):
        for i in self.P:
            return i

Spider file

root = os.getcwd()
start_url = URL()
p = os.path.join(root, 'Company_Lists', 'Test_of_company.csv')
start_url.read(p)

class company-spider(BaseSpider):
    name = "Company-page"
    allowed_domains = ["CompanyDomain.se"]
    start_urls = [start_url.company-site()]

Upvotes: 1

Views: 1552

Answers (1)

warvariuc
warvariuc

Reputation: 59604

Replace

def company-site(self):
    for i in self.P:
        return i

with

def urls(self):
    for url in self.P:
        yield url

Replace

start_urls = [start_url.company-site()]

with

start_urls = start_url.urls()

or

start_urls = start_url.P

Because Spider.start_requests looks like this:

def start_requests(self):
    for url in self.start_urls:
        yield self.make_requests_from_url(url)

Upvotes: 1

Related Questions