Reputation: 1351
I'm building a spider / scraper with Scrapy and was wondering which would be more efficient: to initialize an API wrapper object once as a class attribute? Or reinitialize with each URL request? I'm wondering in the context of overall efficiency and memory (leaks) as this will be a fairly large project (millions of requests).
Case 1:
# init API wrapper ONCE as class attribute
class ScrapySpider():
api = SomeAPIWrapper()
urls = [
'https://website.com',
# ... +1mil URLs
]
def request(self):
for url in urls:
yield Request(url)
def parse(self, response):
yield self.api.get_meta(response.url)
Case 2:
# init new API wrapper on EACH request
class ScrapySpider():
urls = [
'https://website.com',
# ... +1mil URLs
]
def request(self):
for url in urls:
yield Request(url)
def parse(self, response):
api = SomeAPIWrapper()
yield api.get_meta(response.url)
Upvotes: 0
Views: 75
Reputation: 77892
There's no generic, one-size-fits-all answer to this question - it depends on how costly the object's instantiation is, how often you end up instanciating it in best / average / worst case, and, with your example using a class attribute (instead of an instance attribute), whether it's safe to share this object amongst all instances of the host class.
Note that there are two other terms to the alternative:
1/ a per-instance attribute created in the initializer:
class ScrapySpider():
def __init__(self, *args, **kw):
super().__init__(*args, **kw)
self.api = SomeAPIWrapper()
which avoids the concurrent access issues you might get with a class attribute, and
2/ a cached property
class ScrapySpider():
@property
def api(self):
if not hasattr(self, "_cached_api"):
self._cached_api = ApiWrapper()
return self._cached_api
which also prevents creating the ApiWrapper instance before it's needed (might be useful if creating it is costly and it's not always needed) but adds a small overhead on attribute access.
Upvotes: 1
Reputation: 3857
In the example code, using a class attribute (Case 1) should be more efficient.
Upvotes: 2