Reputation: 942

Why is Scrapy constantly increasing memory usage?

I have a scraping project used scrapy and it will continuously run, meaning when it's finished, I have a script to make it run again.

It takes about 10 hours to finish a round and the memory increases by 100mb and not released after it.

I used JOBDIR it helps with the speed but no luck for solving the memory issue.

Here's what I have got for the spider:

with open(suburbLinkFileName) as f:
    data = json.load(f)
    for link in data:
        all_suburb_links.append(link['result'])

def parse(self, response):
    for suburb_link in self.all_suburb_links:
        absolute_next_suburb_link = 'http://www.xxxx.com.au/buy/' + suburb_link + "?includeSurrounding=false"
        yield Request(url=absolute_next_suburb_link, callback=self.parse_suburb)

def parse_suburb(self, response):
    properties_urls = response.xpath("//*[@class='details-link ']/@href").extract()

for property_url in properties_urls:
    absolute_property_url = self.base_url + property_url
    yield Request(absolute_property_url, callback=self.parse_property)

    next_page_url = response.xpath('//a[@class="rui-button-brand pagination__link-next"]/@href').extract_first()
    if next_page_url is None:
        return None

    absolute_next_page_url = self.base_url + next_page_url
    yield Request(url=absolute_next_page_url, callback=self.parse_suburb)

def parse_property(self, response):
    if not response.xpath('//title'):
        yield Request(url=response.url, dont_filter=True)

I couldn't see I got anything leaking the memory.. Took me couple of days already but no luck..

Edit:

I just found out this: https://docs.scrapy.org/en/latest/topics/leaks.html#leaks-without-leaks

I guess it may be the python issue...

Upvotes: 4

Answers (3)

Zach

Reputation: 1351

I had a similar issue which I struggled to debug for a very long time. I think there might be a leak somewhere in Scrapy which gets compounded on large crawls because a lot of people have this issue, but I was never able to pinpoint it.

Anyway, I solved it by adding garbage collection decorators on a few of my methods. I would first add them on all of your methods and see how much garbage is being collected, and then remove as necessary (this will likely slow down your spider so you'll only want to use them when needed).

As an aside, I would suggest trying a disk-based request queue before adding garbage collection decorators, as this might solve your issue without having to refactor anything. Instructions on how to do that can be found in their docs.

And if that doesn't solve your issue, here is the decorator that I use:

# garbage collection decorator
def collect_garbage(func):
    def wrapper(*args, **kwargs):
        print(f'\nCollected Garbage before {func.__name__}: {gc.collect()}\n')
        return func(*args, **kwargs)
    return wrapper

Hope that helps!

Upvotes: 1

Georgiy

Reputation: 3561

I think that this increasing memory usage is result of RFPDupeFilter
Scrapy uses this class to filter duplicate requests
By default RFPDupeFilter stores SHA1 hashes
of all urls visited by your webscraper inside python set instance. (unordered collections of unique elements)

This data structure regularly increase memory usage during scraping process.

In theory each SHA1 hash - 40 hexadecimal numbers (20 bytes)
But according to my local tests in this implementation it stored as str with 40 symbols.
sys.getsizeof function on this hash - 89 bytes (python3.6 on Win10 x64)

If You visited 270k of urls we can estimate that 24030000 (~23 megabytes) used to store all SHA1 hashes.

And I didn't count size of hashtables for set (required to quickly filter nonunique elements from set)

If you use JOBDIR - scrapy loads/updates dupefilter data from requests.seen file.

I suppose that real memory required to store all this dupefilter data can exceed 100mb.

Upvotes: 1

Umair Ayub

Reputation: 21241

One thing I can recommend is to change

with open(suburbLinkFileName) as f:
    data = json.load(f)
    for link in data:
        all_suburb_links.append(link['result'])

with open(suburbLinkFileName) as f:
    data = json.load(f)
    all_suburb_links = tuple(link['result'] for link in data)

Because tuple takes much less memory than the list

Upvotes: 0

Why is Scrapy constantly increasing memory usage?

Edit:

Answers (3)

Related Questions