Getting request when scraping?

Question

I am a freshman at CMU who is completely lost in his first term project, and I would really appreciate your help :)

I am writing a scraping tool, and sometimes a request just does not respond. It doesn't return anything; it does not even return an error. This problem makes my scraper get stuck on one URL instead of recognizing that it is stuck and moving on. Here is the code:

def extractHTML(url):
    startTime = time.time()
    headers = requests.utils.default_headers()
    headers.update(
        {'User-Agent':
         'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',})
    paper1Link = requests.get(url,headers=headers)
    papaer1Content=BeautifulSoup(paper1Link.content,"lxml")
    return str(papaer1Content)

How do I make python recognize that I am stuck and break a few seconds?
This site: http://www.apa.org/ won't even grant me access even if I change the header to make the request look like a regular lone. How would I be able to get request from this?

aghast · Accepted Answer

The requests documentation has a section called "Timeouts". Perhaps you should start there.

paper1Link = requests.get(url,headers=headers, timeout=0.4)

Getting request when scraping?

Answers (1)

Related Questions