Reputation: 1000
I have collected links to wanted people from Interpol website . There are about 10k links. Scraping one by one takes hours so I am looking for the way to do it asynchronously with grequests
.
This is the preview of my links list:
final_links[:20]
['https://www.interpol.int/notice/search/wanted/2009-19572',
'https://www.interpol.int/notice/search/wanted/2015-74196',
'https://www.interpol.int/notice/search/wanted/2014-37667',
'https://www.interpol.int/notice/search/wanted/2011-30019',
'https://www.interpol.int/notice/search/wanted/2009-34171',
'https://www.interpol.int/notice/search/wanted/2012-334072',
'https://www.interpol.int/notice/search/wanted/2012-334068',
'https://www.interpol.int/notice/search/wanted/2012-334070',
'https://www.interpol.int/notice/search/wanted/2013-26064',
'https://www.interpol.int/notice/search/wanted/2013-2528',
'https://www.interpol.int/notice/search/wanted/2014-32597',
'https://www.interpol.int/notice/search/wanted/2013-23413',
'https://www.interpol.int/notice/search/wanted/2010-42146',
'https://www.interpol.int/notice/search/wanted/2015-30555',
'https://www.interpol.int/notice/search/wanted/2013-2514',
'https://www.interpol.int/notice/search/wanted/2010-53288',
'https://www.interpol.int/notice/search/wanted/2015-58805',
'https://www.interpol.int/notice/search/wanted/2015-58807',
'https://www.interpol.int/notice/search/wanted/2015-58803',
'https://www.interpol.int/notice/search/wanted/2015-62307']
FOr now I am trying to just obtain response fro each link:
unsent_request = (grequests.get(url) for url in final_links)
results = grequests.map(unsent_request)
The first couple of results are responses 200 but then most of them (not all though) are 403. Is it just the Interpol server that doesn't allow that or it's me doing something wrong (am I too greedy?:) )? When I go one by one with requests
, it works fine.
Upvotes: 1
Views: 323
Reputation: 396
This is most likely due to their website protections. You are essentially programmatically spamming them. So they let you do a few requests, then give you a 403 forbidden for being a bad boy. You could simply check for the return status code, and do a small sleep on 403 and try the request again, increasing the sleep each time until you are good again. Or you could do the requests over tor, and keep changing your circuit once you receive a 403 to get a new exit node.
Upvotes: 2