Reputation: 3299
The objective is to scrape multiple pages using BeautifulSoup
whose input is from the requests.get
module.
The steps are:
First load up the html using requests
page = requests.get('https://oatd.org/oatd/' + url_to_pass)
Then, scrape the html
content using the definition below:
def get_each_page(page_soup):
return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
paper_title=page_soup.find(attrs={"itemprop": "name"}).text)
Say, we have a hundred of unique url to be scraped ['record?record=handle\:11012\%2F16478&q=eeg'] * 100
, the whole process can be completed via the code below:
import requests
from bs4 import BeautifulSoup as Soup
def get_each_page(page_soup):
return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
paper_title=page_soup.find(attrs={"itemprop": "name"}).text)
list_of_url = ['record?record=handle\:11012\%2F16478&q=eeg'] * 100 # In practice, there will be 100 diffrent unique sub-href. But for illustration purpose, we purposely duplicate the url
all_website_scrape = []
for url_to_pass in list_of_url:
page = requests.get('https://oatd.org/oatd/' + url_to_pass)
if page.status_code == 200:
all_website_scrape.append(get_each_page(Soup(page.text, 'html.parser')))
However, each of the url is requested and scrape one each a time, hence in principle time consuming.
I wonder if there is other way to increase the performance of the above code that I am not aware of?
Upvotes: 0
Views: 222
Reputation: 96
realpython.com has a nice article about speeding up python scripts up with concurrency.
https://realpython.com/python-concurrency/
Using their example for threading, you can set the number of workers to execute multiple threads which increase the number of requests you can make at once.
from bs4 import BeautifulSoup as Soup
import concurrent.futures
import requests
import threading
import time
def get_each_page(page_soup):
return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text,
paper_title=page_soup.find(attrs={"itemprop": "name"}).text)
def get_session():
if not hasattr(thread_local, "session"):
thread_local.session = requests.Session()
return thread_local.session
def download_site(url_to_pass):
session = get_session()
page = session.get('https://oatd.org/oatd/' + url_to_pass, timeout=10)
print(f"{page.status_code}: {page.reason}")
if page.status_code == 200:
all_website_scrape.append(get_each_page(Soup(page.text, 'html.parser')))
def download_all_sites(sites):
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
executor.map(download_site, sites)
if __name__ == "__main__":
list_of_url = ['record?record=handle\:11012\%2F16478&q=eeg'] * 100 # In practice, there will be 100 diffrent unique sub-href. But for illustration purpose, we purposely duplicate the url
all_website_scrape = []
thread_local = threading.local()
start_time = time.time()
download_all_sites(list_of_url)
duration = time.time() - start_time
print(f"Downloaded {len(all_website_scrape)} in {duration} seconds")
Upvotes: 1
Reputation: 910
You maybe can use the threading module. You can make the script multi threaded and go much faster. https://docs.python.org/3/library/threading.html
But if you are willing to change your mind ill recommend scrapy
Upvotes: 0