Hicham Morocco
Hicham Morocco

Reputation: 21

Make python requests and BeautifulSoup faster

Hi I want to get some data from a website (users and user's data) and them save data in a SQLite database.

I use lxml and cchardet. But that's not make efficient change.

Upvotes: 2

Views: 187

Answers (1)

ahmedshahriar
ahmedshahriar

Reputation: 1076

If you want to reduce time in the scraping part, I suggest build a multithreaded program to make requests. concurrent.futures is one of the easiest ways to multithread these kinds of requests, in particular using the ThreadPoolExecutor. They even have a simple multithreaded URL request example in the documentation.

import concurrent.futures as futures
import requests
from requests.exceptions import HTTPError

urllist = ...

def get_url_data(url, session):
    try:
        r = session.get(url, timeout=10)
        r.raise_for_status()
    except HTTPError:
        return None

    return r.text

s = requests.Session()

try:
    with futures.ThreadPoolExecutor(max_workers=5) as ex:
        future_to_url = {ex.submit(get_url_data, url, s): url
                         for url in urlist}

    results = {future_to_url[future]: future.result() 
               for future in futures.as_completed(future_to_url)}
finally:
    s.close() 

Also, you may want to check out python scrapy framework, it will scrape the data concurrently, also it comes with many features such as auto-throttle, rotating proxies and user-agents, you can easily integrate with your databases as well.

Upvotes: 1

Related Questions