luka
luka

Reputation: 529

Scraping thousand of urls

I have a function that scrape a list of urls, 200k urls, it took a lot of time, there is a way to speed up this process?

def get_odds(ids):
  headers = {"Referer": "https://www.betexplorer.com",
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}
  s = requests.Session()
  matches=[]
  for id in ids:
    url = f'https://www.betexplorer.com{id}'
    response = s.get(url, headers=headers)
    soup = BeautifulSoup(response.text,'html.parser')
    season = url.split('/')[5]

    "do stuff.."

ids is a list

['/soccer/england/premier-league/brentford-norwich/vyLXXRcE/'
...]

Upvotes: 0

Views: 270

Answers (2)

Peter White
Peter White

Reputation: 338

Yes, you can use multiprocessing.

Something like:

from multiprocessing import Pool

if __name__ == "__main__":
    threads = 10 # The number of concurrent requests
    p = Pool(threads)
    p.map(get_odds, ids)
    p.terminate()

Where ids is the list of ids, and get_odds is the function you supplied, but modified to operate on JUST ONE of the ids. Keep in mind that you will be hitting their server with 10 requests at a time, and this can lead to temporary ip blocking (as you are seen as hostile). You should be mindful of that and adjust the pool size and or add sleep() logic.

The get odds function should be like:

def get_odds(id):
  headers = {"Referer": "https://www.betexplorer.com",
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}
  s = requests.Session()
  matches=[]
  url = f'https://www.betexplorer.com{id}'
  response = s.get(url, headers=headers)
  soup = BeautifulSoup(response.text,'html.parser')
  season = url.split('/')[5]

  "do stuff.."

Upvotes: 1

Dakopen
Dakopen

Reputation: 82

You could let them run in parallel via Multithreading. E.g. creating 10 threads and based on the ending of your id (0, 1, 2, 3, ...) knows the thread which ID it should scrape. Does only work with enough computing power and stable internet connection.

EDIT: Since ID is a list, check the index instead to identify which thread should scrape which website.

Upvotes: 0

Related Questions