Reputation: 495
I am trying to unshort a lot of URLs which I have in a urlSet. The following code works most of the time. But some times it takes a very long time to finish. For example I have 2950 in urlSet. stderr tells me that 2900 is done, but getUrlMapping does not finish.
def getUrlMapping(urlSet):
# get the url mapping
urlMapping = {}
#rs = (grequests.get(u) for u in urlSet)
rs = (grequests.head(u) for u in urlSet)
res = grequests.imap(rs, size = 100)
counter = 0
for x in res:
counter += 1
if counter % 50 == 0:
sys.stderr.write('Doing %d url_mapping length %d \n' %(counter, len(urlMapping)))
urlMapping[ getOriginalUrl(x) ] = getGoalUrl(x)
return urlMapping
def getGoalUrl(resp):
url=''
try:
url = resp.url
except:
url = 'NULL'
return url
def getOriginalUrl(resp):
url=''
try:
url = resp.history[0].url
except IndexError:
url = resp.url
except:
url = 'NULL'
return url
Upvotes: 3
Views: 2163
Reputation: 3301
Probably it won't help you as it has passed a long time but still..
I was having some issues with Requests, similar to the ones you are having. To me the problem was that Requests took ages to download some pages, but using any other software (browsers, curl, wget, python's urllib) everything worked fine...
Afer a LOT of time wasted, I noticed that the server was sending some invalid headers, for example, in one of the "slow" pages, after Content-type: text/html
it began to send header in the form Header-name : header-value
(notice the space before the colon). This somehow breaks Python's email.header
functionality used to parse HTTP headers by Requests so the Transfer-encoding: chunked
header wasn't being parsed.
Long story short: manually setting the chunked
property to True
of Response objects before asking for the content solved the issue. For example:
response = requests.get('http://my-slow-url')
print(response.text)
took ages but
response = requests.get('http://my-slow-url')
response.raw.chunked = True
print(response.text)
worked great!
Upvotes: 1