Reputation: 1523
My current code creates the separate Session
object for every request through the .get()
method:
content_getters.py
(the relevant part):
def get_page_content(link: str) -> bytes:
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; "
"Intel Mac OS X 10_11_6) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/61.0.3163.100 Safari/537.36"}
response = requests.get(link, headers=headers)
html = response.content.decode("utf-8")
if response.status_code != requests.codes.ok:
raise ConnectionError("Page", link, "returned status code",
response.status_code)
return response.content
def parse_single_page(link):
content = get_page_conent(link)
# rest of very long function
main.py
:
from concurrent.futures.thread import ThreadPoolExecutor
from content_getters import get_page_content, extract_links, parse_single_page
if __name__ == "__main__":
MAX_THREADS = 30
# get links
html: str = get_page_content(
"https://www.d20pfsrd.com/bestiary/bestiary-hub/monsters-by-cr/") \
.decode("utf-8")
links = extract_links(html)
num_threads = min(MAX_THREADS, len(links))
with ThreadPoolExecutor(max_workers=num_threads) as executor:
# asynchronous, threads will return results when they finish their
# own work
results = [result for result
in executor.map(parse_single_page, links)]
requests
docs (link) state that "if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase". I suppose that my separate calls to the .get()
method create separate Session
objects for each call, which can be faster.
Question: Is the Session
object synchronous (sequential) for all requests made with it? Will I still get asynchronous requests if I use the same Session
object for all threads in concurrent.futures.thread.ThreadPoolExecutor
, instead of 1 Session
per thread as I'm doing now?
Upvotes: 0
Views: 2541
Reputation: 11
As per the documentation, requests.Session
uses urllib3
's connection pooling
for the sessions. And as per urllib3
's documentation, it is a thread-safe system now.
When the question was originally posted it probably wasn't, but in a GitHub comment, it was most likely made thread-safe for good.
Upvotes: 1
Reputation: 4064
In short, Session
is not thread-safe, you can check the issue discussion on Github.
For your case, I would highly recommend to look toward the asyncio
and the aiohttp
module, where you will have freedom to pass around a session
since everything will be in one thread. It also won't induce as much overhead as the multithreading
. As they say:
Use asyncio when you can, use threads when you must
The documentation on aiohttp
.
Upvotes: 2