Reputation: 279
I am trying to figure out how to create concurrent requests with multithreading whilst using the requests library. I want to grab the links, and total pages from a url's POST
request.
However, I am iterating over a very large loop so it will take an awfully long time. What I have tried doesn't seem to make the requests concurrent nor does it produce an output.
Here's what I have tried:
#smaller subset of my data
df = {'links': ['https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D687',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D492',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D499',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D702',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D6143'],
'make': [138.0,138.0,138.0,138.0,138.0],
'model': [687.0,492.0,499.0,702.0,6143.0],
'country_id': [6.0,6.0,6.0,6.0,6.0]}
import requests
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
import threading
import gc
def get_links(url):
headers = {
'authority': 'www.theparking.eu',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
'accept': '*/*',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'x-requested-with': 'XMLHttpRequest',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
'sec-ch-ua-platform': '"macOS"',
'origin': 'https://www.theparking.eu',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.theparking.eu/used-cars/used-cars/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
formal_data = defaultdict(list)
for id_ in df['country_id']:
for make in df['make']:
for model in df['model']:
data = {
'ajax': '{"tab_id":"t0","cur_page":1,"cur_trie":"distance","query":"","critere":{"id_pays":[%s],"id_marque":[%s], "id_modele":[%s]},"sliders":{"prix":{"id":"#range_prix","face":"prix","max_counter":983615,"min":"1","max":"400000"},"km":{"id":"#range_km","face":"km","max_counter":1071165,"min":"1","max":"500000"},"millesime":{"id":"#range_millesime","face":"millesime","max_counter":1163610,"min":"1900","max":"2022"}},"req_num":1,"nb_results":"11795660","current_location_distance":-1,"logged_in":false}' % (round(id_), round(make), round(model)),
'tabs': '["t0"]'
}
response = requests.post(url, headers=headers, data=data)
test = json.loads(response.text)
pages = round(int(test['context']['nb_results'])/27)
if pages != 0:
formal_data['total_pages'].append(pages)
formal_data['links'].append(url)
print(f'You are on this link:{url}')
return formal_data
threadLocal = threading.local()
with ThreadPool(8) as pool:
urls = df['links']
pool.map(get_links, urls)
# must be done before terminate is explicitly or implicitly called on the pool:
del threadLocal
gc.collect()
Upvotes: 1
Views: 3129
Reputation: 1533
Note that a more modern approach to use requests
asynchronously is to use other libraries, like requests-threads.
With your approach, you connect to various URL in parallel, but sequentially to each URL. Consequentially, you might no be taking full advantage of multithreading. Indeed, for a single URL in df['links']
, you get the same results as a single thread. The easiest way to circumvent this is to use itertools.product
, that makes an iterator of what would be otherwise a nested loop.
import requests
from concurrent.futures import ThreadPoolExecutor as ThreadPool
from itertools import product
# ... snipped df definition ...
def get_links(packed_pars):
url, id_, make, model = packed_pars
headers = {
'authority': 'www.theparking.eu',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97","Chromium";v="97"',
'accept': '*/*',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'x-requested-with': 'XMLHttpRequest',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
'sec-ch-ua-platform': '"macOS"',
'origin': 'https://www.theparking.eu',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.theparking.eu/used-cars/used-cars/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
data = {
'ajax': '{"tab_id":"t0","cur_page":1,"cur_trie":"distance","query":"","critere":{"id_pays":[%s],"id_marque":[%s], "id_modele":[%s]},"sliders":{"prix":{"id":"#range_prix","face":"prix","max_counter":983615,"min":"1","max":"400000"},"km":{"id":"#range_km","face":"km","max_counter":1071165,"min":"1","max":"500000"},"millesime":{"id":"#range_millesime","face":"millesime","max_counter":1163610,"min":"1900","max":"2022"}},"req_num":1,"nb_results":"11795660","current_location_distance":-1,"logged_in":false}' % (round(id_), round(make), round(model)),
'tabs': '["t0"]'
}
response = requests.post(url, headers=headers, data=data)
test = response.json()
pages = round(int(test['context']['nb_results'])/27)
if pages != 0:
print(f'You are on this link:{url}, with {pages} pages')
else:
print("no pages")
return url, pages
with ThreadPool(8) as pool:
rv = pool.map(get_links, product(df['links'], df['country_id'], df['make'],
df['model']))
# This converts rv to the dict of the original post:
formal_data = dict()
filtered_list = [(url, pages) for url, pages in rv if pages]
if filtered_list:
formal_data['links'], formal_data['total_pages'] = zip(*filtered_list)
else: # Protect against empty answers
formal_data['links'], formal_data['total_pages'] = [], []
As for why this is not producing any output: at last with the data provided in the question, test['context']['nb_results']
is 0
every time. It is very likely that your query returns zero items each time even with the full dataset.
Some other comments:
m̀ultiprocessing.pool.ThreadPool
is not recommended: you should switch to concurrent.futures.ThreadPoolExecutor.threadLocal
at all: it can be removed. I do not know what you would use it for.threading
but not using it.request
responses have a json
method that parses the text right away: there is no need for importing json
in this case.ceil
instead of round
the number of pages.Upvotes: 3
Reputation: 144
so one thing is that for programs like using web apis where they are I/O bound (here where the performance hit taking is waiting on requests from another machine/server/etc), the more general approach is to using async programming. a good library for async http requests is httpx (there are others as well). you'll find interface of these libraries similar to requests
along with allowing to be able to do async or sync, so should be easy transition to use. from there will want to learn about async pogramming in python. the quickstart and async along with other good tutorials can find via google on general python async programming.
can see that this is approach other python http wrapper libraries do like asyncpraw
Just as quick note on why async is nice vs multiprocessing. is that:
Upvotes: 1