Reputation: 584
I'm new to python language and I started working with Scrapy which is amazing, and I would like to know if there is any way to write less and avoid lots of if statements also avoid repeaters in my code :
titleText = response.css('title::text').get()
if response.status == 200:
if len(response.css('title').extract()) == 0:
self.data['crawl']['meta']['missingTitle'].append({
'url': response.url,
'latency': response.meta.get('download_latency')
})
elif len(response.css('title').extract()) == 1:
if len(titleText) > 70:
self.data['crawl']['meta']['longTitle'].append({
'url': response.url,
'text': titleText,
'latency': response.meta.get('download_latency')
})
elif len(titleText) < 50:
self.data['crawl']['meta']['shortTitle'].append({
'url': response.url,
'text': titleText,
'latency': response.meta.get('download_latency')
})
else:
self.data['crawl']['meta']['multipleTitleTag'].append({
'url': response.url,
'text': titleText,
'latency': response.meta.get('download_latency')
})
elif response.status in self.handle_httpstatus_list:
self.data['crawl']['blockers']['url'].append({
'url': response.url,
'code':response.status,
'text': titleText,
'latency': response.meta.get('download_latency')
})
Thank you
Upvotes: 0
Views: 59
Reputation: 584
titleText = response.css('title::text').get()
metaResponse = {'text':titleText, 'url':response.url, 'httpCode':response.status, 'latency':response.meta.get('download_latency')}
if response.status == 200:
if len(response.css('title').extract()) == 0:
self.data['crawl']['meta']['missingTitle'].append(metaResponse)
elif len(response.css('title').extract()) == 1:
if len(titleText) > 70:
self.data['crawl']['meta']['longTitle'].append(metaResponse)
elif len(titleText) < 40:
self.data['crawl']['meta']['shortTitle'].append(metaResponse)
else:
self.data['crawl']['meta']['multipleTitleTag'].append(metaResponse)
elif response.status in self.handle_httpstatus_list:
self.data['crawl']['blockers']['url'].append(metaResponse)
Upvotes: 0
Reputation: 2776
You have a lot of repeating code, but all of that is hard to remove if you have such different possibilities.
titleText = response.css('title::text').get()
dataObj = {
'url': response.url,
'latency': response.meta.get('download_latency')
}
title = response.css('title').extract()
tagName = 'missingTitle' #default
if response.status == 200:
if len(title) == 0:
tagName = 'missingTitle'
else:
dataObj['text'] = titleText
if len(title) > 1:
tagName = 'multipleTitleTag'
elif len(title) ==1:
if len(titleText) > 70:
tagName = 'longTitle'
elif len(titleText) < 50:
tagName = 'shortTitle'
else:
#Really nothing done if title between 50 and 70?
pass
self.data['crawl']['meta'][tagName].append(dataObj)
elif response.status in self.handle_httpstatus_list:
dataObj['code'] = response.status
dataObj['text'] = titleText
self.data['crawl']['meta']['multipleTitleTag'].append(dataObj)
The title tag functionality could be moved to a helper function to make things a bit cleaner. There the tag name choice can be well structured with early returns. Altogether it could look sth like this (adjust details accordingly):
def get_tag_name(self, title, titleText):
if len(title) == 0:
return 'missingTitle'
if len(title) > 1:
return 'multipleTitleTag'
if len(titleText) > 70:
return 'longTitle'
if len(titleText) < 50:
return 'longTitle'
return 'DEFAULT' #???
def the_main_one(self, response):
titleText = response.css('title::text').get()
dataObj = {
'url': response.url,
'latency': response.meta.get('download_latency')
}
tagName = 'multipleTitleTag' #default
title = response.css('title').extract()
if response.status == 200:
tagName = self.get_tag_name(title, titleText)
if len(title) == 0:
dataObj['text'] = titleText
elif response.status in self.handle_httpstatus_list:
dataObj['code'] = response.status
dataObj['text'] = titleText
self.data['crawl']['meta'][tagName].append(dataObj)
It might not be a lot shorter, but it definitely removes some clutter and helps in maintaining of the code later
If you had a lot of IFs in the same level, you could think about different approach. Now this seems about right (at least for me).
Upvotes: 1