Heroku web scraping app (usually but not always) gets 403 error on most websites

I have a web scraping app hosted by heroku that I use to scrape about 40 company web pages. 27 of them will almost always give me 403 errors on heroku, but every page works fine if I run the code locally.

After about 25 minutes of running the app and getting 403 errors (the timeframe varies a lot), all of the pages magically start working, but will return 403s again if the app restarts.

How can I prevent these 403 errors from happening at all? Relevant code as follows:

from bs4 import BeautifulSoup as soup
import urllib.request as ureq
from urllib.error import HTTPError
import time

    def scraper(url):
        user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0'
        ufile0 = ureq.Request(url, headers={'User-Agent': user_agent,
                                          'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                                          'Referer': 'https://www.google.com/'})
        try:
            ufile1 = ureq.urlopen(ufile0)
        except HTTPError as err:
            if err.code == 504:
                print('504, need to drop it right now')
                return
            elif err.code == 403:
                print('403ed oof')
                return
            else:
                print('unknown http error')
                raise
        text = ufile1.read()
        ufile1.close()
        psoup = soup(text, "html.parser")


while 1:
    url='http://ir.nektar.com/press-releases?page=0'
    scraper(url)
    time.sleep(7)

Upvotes: 2

Answers (2)

Jeremy Shubert

Reputation: 67

You may need this to run behind a proxy. Fixie is available in Heroku.

Upvotes: 0

iZooGooD

Reputation: 37

I had a similar problem like you. My django web application was working fine locally but after deploying on heroku it was returning nothing. I fixed it buy using a background worker.

I found this on heroku documentation:- One cause of request timeouts is an infinite loop in the code. Test locally (perhaps using a local copy of the production database, extracted using pgbackups) and see if you can replicate the problem and fix the bug.

Another possibility is that you are trying to do some sort of long-running task inside your web process, such as:

Sending an email Accessing a remote API (posting to Twitter, querying Flickr, etc.) Web scraping / crawling Rendering an image or PDF Heavy computation (computing a fibonacci sequence, etc.) Heavy database usage (slow or numerous queries, N+1 queries) If so, you should move this heavy lifting into a background job which can run asynchronously from your web request. See Worker Dynos, Background Jobs and Queueing for details.

Upvotes: 1

Heroku web scraping app (usually but not always) gets 403 error on most websites

Answers (2)

Related Questions