rpb
rpb

Reputation: 3299

How to accelerate Webscraping using the combination of Request and BeautifulSoup in Python?

The objective is to scrape multiple pages using BeautifulSoup whose input is from the requests.get module.

The steps are:

First load up the html using requests

page = requests.get('https://oatd.org/oatd/' + url_to_pass)

Then, scrape the html content using the definition below:

def get_each_page(page_soup):
    return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
                paper_title=page_soup.find(attrs={"itemprop": "name"}).text)

Say, we have a hundred of unique url to be scraped ['record?record=handle\:11012\%2F16478&q=eeg'] * 100, the whole process can be completed via the code below:

import requests
from bs4 import BeautifulSoup as Soup

def get_each_page(page_soup):
    return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
                paper_title=page_soup.find(attrs={"itemprop": "name"}).text)

list_of_url = ['record?record=handle\:11012\%2F16478&q=eeg'] * 100 # In practice, there will be 100 diffrent unique sub-href. But for illustration purpose, we purposely duplicate the url
all_website_scrape = []
for url_to_pass in list_of_url:

    page = requests.get('https://oatd.org/oatd/' + url_to_pass)
    if page.status_code == 200:
        all_website_scrape.append(get_each_page(Soup(page.text, 'html.parser')))

However, each of the url is requested and scrape one each a time, hence in principle time consuming.

I wonder if there is other way to increase the performance of the above code that I am not aware of?

Upvotes: 0

Views: 222

Answers (2)

mihuo999o
mihuo999o

Reputation: 96

realpython.com has a nice article about speeding up python scripts up with concurrency.

https://realpython.com/python-concurrency/

Using their example for threading, you can set the number of workers to execute multiple threads which increase the number of requests you can make at once.

    from bs4 import BeautifulSoup as Soup
    import concurrent.futures
    import requests
    import threading
    import time
    
    def get_each_page(page_soup):
        return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
                    paper_title=page_soup.find(attrs={"itemprop": "name"}).text)
    
    def get_session():
        if not hasattr(thread_local, "session"):
            thread_local.session = requests.Session()
        return thread_local.session
    
    def download_site(url_to_pass):
        session = get_session()
        page = session.get('https://oatd.org/oatd/' + url_to_pass, timeout=10)
        print(f"{page.status_code}: {page.reason}")
        if page.status_code == 200:
            all_website_scrape.append(get_each_page(Soup(page.text, 'html.parser')))
    
    def download_all_sites(sites):
        with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
            executor.map(download_site, sites)
    
    if __name__ == "__main__":
        list_of_url = ['record?record=handle\:11012\%2F16478&q=eeg'] * 100  # In practice, there will be 100 diffrent unique sub-href. But for illustration purpose, we purposely duplicate the url
        all_website_scrape = []
        thread_local = threading.local()
        start_time = time.time()
        download_all_sites(list_of_url)
        duration = time.time() - start_time
        print(f"Downloaded {len(all_website_scrape)} in {duration} seconds")

Upvotes: 1

UWTD TV
UWTD TV

Reputation: 910

You maybe can use the threading module. You can make the script multi threaded and go much faster. https://docs.python.org/3/library/threading.html

But if you are willing to change your mind ill recommend scrapy

Upvotes: 0

Related Questions