Reputation: 1903
in those days i'm making a Spider with Scrapy in Python. It's basically a simple spider class, that make simple parsing of some field in a Html page. I don't use the starts_url[] Scrapy field, but i use a personalized list like this:
class start_urls_mod():
def __init__(self, url, data):
self.url=url
self.data=data
#Defined in the class:
url_to_scrape = []
#Populated in the body in this way
self.url_to_scrape.append(start_urls_mod(url_found), str(data_found))
passing the url in this way
for any_url in self.url_to_scrape:
yield scrapy.Request(any_url.url, callback=self.parse_page)
It works good with a limited numbers of url like 3000.
But if i try to make a test and it found about 32532 url to scrape. In the JSON output file i found only about 3000 url scraped.
My function recall it self:
yield scrapy.Request(any_url.url, callback=self.parse_page)
So the question is, there is some memory limit for the Scrapy items?
Upvotes: 0
Views: 4547
Reputation: 18799
No, if you haven't specified CLOSESPIDER_ITEMCOUNT on your settings.
maybe scrapy is finding duplicates in your requests, please check if that stats contain something like dupefilter/filtered
on your logs.
Upvotes: 2