Reputation: 942
I have a scraping project used scrapy and it will continuously run, meaning when it's finished, I have a script to make it run again.
It takes about 10 hours to finish a round and the memory increases by 100mb and not released after it.
I used JOBDIR
it helps with the speed but no luck for solving the memory issue.
Here's what I have got for the spider:
with open(suburbLinkFileName) as f:
data = json.load(f)
for link in data:
all_suburb_links.append(link['result'])
def parse(self, response):
for suburb_link in self.all_suburb_links:
absolute_next_suburb_link = 'http://www.xxxx.com.au/buy/' + suburb_link + "?includeSurrounding=false"
yield Request(url=absolute_next_suburb_link, callback=self.parse_suburb)
def parse_suburb(self, response):
properties_urls = response.xpath("//*[@class='details-link ']/@href").extract()
for property_url in properties_urls:
absolute_property_url = self.base_url + property_url
yield Request(absolute_property_url, callback=self.parse_property)
next_page_url = response.xpath('//a[@class="rui-button-brand pagination__link-next"]/@href').extract_first()
if next_page_url is None:
return None
absolute_next_page_url = self.base_url + next_page_url
yield Request(url=absolute_next_page_url, callback=self.parse_suburb)
def parse_property(self, response):
if not response.xpath('//title'):
yield Request(url=response.url, dont_filter=True)
I couldn't see I got anything leaking the memory.. Took me couple of days already but no luck..
I just found out this: https://docs.scrapy.org/en/latest/topics/leaks.html#leaks-without-leaks
I guess it may be the python issue...
Upvotes: 4
Views: 1750
Reputation: 1351
I had a similar issue which I struggled to debug for a very long time. I think there might be a leak somewhere in Scrapy which gets compounded on large crawls because a lot of people have this issue, but I was never able to pinpoint it.
Anyway, I solved it by adding garbage collection decorators on a few of my methods. I would first add them on all of your methods and see how much garbage is being collected, and then remove as necessary (this will likely slow down your spider so you'll only want to use them when needed).
As an aside, I would suggest trying a disk-based request queue before adding garbage collection decorators, as this might solve your issue without having to refactor anything. Instructions on how to do that can be found in their docs.
And if that doesn't solve your issue, here is the decorator that I use:
# garbage collection decorator
def collect_garbage(func):
def wrapper(*args, **kwargs):
print(f'\nCollected Garbage before {func.__name__}: {gc.collect()}\n')
return func(*args, **kwargs)
return wrapper
Hope that helps!
Upvotes: 1
Reputation: 3561
I think that this increasing memory usage is result of RFPDupeFilter
Scrapy uses this class to filter duplicate requests
By default RFPDupeFilter
stores SHA1
hashes
of all urls visited by your webscraper inside python set
instance. (unordered collections of unique elements)
This data structure regularly increase memory usage during scraping process.
In theory each SHA1
hash - 40 hexadecimal numbers (20 bytes)
But according to my local tests in this implementation it stored as str
with 40 symbols.
sys.getsizeof
function on this hash - 89 bytes (python3.6 on Win10 x64)
If You visited 270k of urls we can estimate that 24030000 (~23 megabytes) used to store all SHA1 hashes.
And I didn't count size of hashtables for set
(required to quickly filter nonunique elements from set
)
If you use JOBDIR
- scrapy loads/updates dupefilter data from requests.seen
file.
I suppose that real memory required to store all this dupefilter data can exceed 100mb.
Upvotes: 1
Reputation: 21241
One thing I can recommend is to change
with open(suburbLinkFileName) as f:
data = json.load(f)
for link in data:
all_suburb_links.append(link['result'])
to
with open(suburbLinkFileName) as f:
data = json.load(f)
all_suburb_links = tuple(link['result'] for link in data)
Because tuple
takes much less memory than the list
Upvotes: 0