Reputation: 53
I'm trying to build a crawler that will crawl a list of sites by following all links in their first page, then repeating this for the new pages. I think I might be incorrectly using the rules attribute. The spider never calls the processor method. It seams that no links are ever followed and there are no error messages. I've omitted some of the functions to show the changes I made to add crawling. I'm using Scrapy 1.5
class Scraper(CrawlSpider):
name = "emails"
lx = LinkExtractor()
rules = [Rule(link_extractor=lx, follow=True, process_links='processor', callback='landed')]
def start_requests(self):
self.inf = DataInterface()
df = self.inf.searchData()
row = df.iloc[2]
print(row)
#url = 'http://' + row['Website'].lower()
#self.rules.append()
url = 'http://example.com/Page.php?ID=7'
req = scrapy.http.Request(url=url, callback=self.landed,
meta={'index': 1, 'depth': 0,
'firstName': row['First Name'],
'lastName': row['Last Name'],
'found': {}, 'title': row['Title']})
yield req
Upvotes: 0
Views: 65
Reputation: 173
Try add after your code and change your callback to parse:
def start_requests(self):
self.inf = DataInterface()
df = self.inf.searchData()
row = df.iloc[2]
print(row)
#url = 'http://' + row['Website'].lower()
#self.rules.append()
url = 'http://example.com/Page.php?ID=7'
req = scrapy.http.Request(url=url, callback=self.parse,
meta={'index': 1, 'depth': 0,
'firstName': row['First Name'],
'lastName': row['Last Name'],
'found': {}, 'title': row['Title']})
yield req
def parse(self, response):
print(response.text)
Upvotes: 1