gstar2002
gstar2002

Reputation: 495

Python grequests takes a long time to finish

I am trying to unshort a lot of URLs which I have in a urlSet. The following code works most of the time. But some times it takes a very long time to finish. For example I have 2950 in urlSet. stderr tells me that 2900 is done, but getUrlMapping does not finish.

def getUrlMapping(urlSet):
# get the url mapping
urlMapping = {}
#rs = (grequests.get(u) for u in urlSet)
rs = (grequests.head(u) for u in urlSet)
res = grequests.imap(rs, size = 100)
counter = 0
for x in res:
    counter += 1
    if counter % 50 == 0:
        sys.stderr.write('Doing %d url_mapping length %d \n' %(counter, len(urlMapping)))
    urlMapping[ getOriginalUrl(x) ]  =   getGoalUrl(x) 
return urlMapping

def getGoalUrl(resp):
url=''
try:
    url = resp.url
except:
    url = 'NULL'
return url

def getOriginalUrl(resp):
url=''
try:
    url = resp.history[0].url
except IndexError:
    url = resp.url
except:
    url = 'NULL'
return url

Upvotes: 3

Views: 2163

Answers (1)

llekn
llekn

Reputation: 3301

Probably it won't help you as it has passed a long time but still..

I was having some issues with Requests, similar to the ones you are having. To me the problem was that Requests took ages to download some pages, but using any other software (browsers, curl, wget, python's urllib) everything worked fine...

Afer a LOT of time wasted, I noticed that the server was sending some invalid headers, for example, in one of the "slow" pages, after Content-type: text/html it began to send header in the form Header-name : header-value (notice the space before the colon). This somehow breaks Python's email.header functionality used to parse HTTP headers by Requests so the Transfer-encoding: chunked header wasn't being parsed.

Long story short: manually setting the chunked property to True of Response objects before asking for the content solved the issue. For example:

response = requests.get('http://my-slow-url')
print(response.text)

took ages but

response = requests.get('http://my-slow-url')
response.raw.chunked = True
print(response.text)

worked great!

Upvotes: 1

Related Questions