Reputation: 21
Hi I want to get some data from a website (users and user's data) and them save data in a SQLite database.
I use lxml and cchardet. But that's not make efficient change.
Upvotes: 2
Views: 187
Reputation: 1076
If you want to reduce time in the scraping part, I suggest build a multithreaded program to make requests. concurrent.futures
is one of the easiest ways to multithread these kinds of requests, in particular using the ThreadPoolExecutor. They even have a simple multithreaded URL request example in the documentation.
import concurrent.futures as futures
import requests
from requests.exceptions import HTTPError
urllist = ...
def get_url_data(url, session):
try:
r = session.get(url, timeout=10)
r.raise_for_status()
except HTTPError:
return None
return r.text
s = requests.Session()
try:
with futures.ThreadPoolExecutor(max_workers=5) as ex:
future_to_url = {ex.submit(get_url_data, url, s): url
for url in urlist}
results = {future_to_url[future]: future.result()
for future in futures.as_completed(future_to_url)}
finally:
s.close()
Also, you may want to check out python scrapy framework, it will scrape the data concurrently, also it comes with many features such as auto-throttle, rotating proxies and user-agents, you can easily integrate with your databases as well.
Upvotes: 1