Concurrency multithreading with requests

Question

I am trying to figure out how to create concurrent requests with multithreading whilst using the requests library. I want to grab the links, and total pages from a url's POST request.

However, I am iterating over a very large loop so it will take an awfully long time. What I have tried doesn't seem to make the requests concurrent nor does it produce an output.

Here's what I have tried:

#smaller subset of my data

df = {'links': ['https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D687',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D492',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D499',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D702',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D6143'],
 'make': [138.0,138.0,138.0,138.0,138.0],
 'model': [687.0,492.0,499.0,702.0,6143.0],
 'country_id': [6.0,6.0,6.0,6.0,6.0]}

import requests
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
import threading
import gc



def get_links(url):
    headers = {
        'authority': 'www.theparking.eu',
        'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
        'accept': '*/*',
        'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'x-requested-with': 'XMLHttpRequest',
        'sec-ch-ua-mobile': '?0',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
        'sec-ch-ua-platform': '"macOS"',
        'origin': 'https://www.theparking.eu',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'cors',
        'sec-fetch-dest': 'empty',
        'referer': 'https://www.theparking.eu/used-cars/used-cars/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }
    formal_data = defaultdict(list)
    for id_ in df['country_id']:
        for make in df['make']:
            for model in df['model']:
                data = {
                    'ajax': '{"tab_id":"t0","cur_page":1,"cur_trie":"distance","query":"","critere":{"id_pays":[%s],"id_marque":[%s], "id_modele":[%s]},"sliders":{"prix":{"id":"#range_prix","face":"prix","max_counter":983615,"min":"1","max":"400000"},"km":{"id":"#range_km","face":"km","max_counter":1071165,"min":"1","max":"500000"},"millesime":{"id":"#range_millesime","face":"millesime","max_counter":1163610,"min":"1900","max":"2022"}},"req_num":1,"nb_results":"11795660","current_location_distance":-1,"logged_in":false}' % (round(id_), round(make), round(model)),
                    'tabs': '["t0"]'
                            }
                response = requests.post(url, headers=headers, data=data)
                test = json.loads(response.text)
                pages = round(int(test['context']['nb_results'])/27)
                if pages != 0:
                    formal_data['total_pages'].append(pages)
                    formal_data['links'].append(url)
                    print(f'You are on this link:{url}')
    return formal_data
threadLocal = threading.local()

with ThreadPool(8) as pool:
    urls = df['links']
    pool.map(get_links, urls)
    # must be done before terminate is explicitly or implicitly called on the pool:
    del threadLocal
    gc.collect()

azelcer · Accepted Answer

Note that a more modern approach to use requests asynchronously is to use other libraries, like requests-threads.

With your approach, you connect to various URL in parallel, but sequentially to each URL. Consequentially, you might no be taking full advantage of multithreading. Indeed, for a single URL in df['links'], you get the same results as a single thread. The easiest way to circumvent this is to use itertools.product, that makes an iterator of what would be otherwise a nested loop.

import requests
from concurrent.futures import ThreadPoolExecutor as ThreadPool
from itertools import product

#   ... snipped df definition ...

def get_links(packed_pars):
    url, id_, make, model = packed_pars
    headers = {
        'authority': 'www.theparking.eu',
        'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97","Chromium";v="97"',
        'accept': '*/*',
        'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'x-requested-with': 'XMLHttpRequest',
        'sec-ch-ua-mobile': '?0',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
        'sec-ch-ua-platform': '"macOS"',
        'origin': 'https://www.theparking.eu',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'cors',
        'sec-fetch-dest': 'empty',
        'referer': 'https://www.theparking.eu/used-cars/used-cars/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }
    data = {
        'ajax': '{"tab_id":"t0","cur_page":1,"cur_trie":"distance","query":"","critere":{"id_pays":[%s],"id_marque":[%s], "id_modele":[%s]},"sliders":{"prix":{"id":"#range_prix","face":"prix","max_counter":983615,"min":"1","max":"400000"},"km":{"id":"#range_km","face":"km","max_counter":1071165,"min":"1","max":"500000"},"millesime":{"id":"#range_millesime","face":"millesime","max_counter":1163610,"min":"1900","max":"2022"}},"req_num":1,"nb_results":"11795660","current_location_distance":-1,"logged_in":false}' % (round(id_), round(make), round(model)),
        'tabs': '["t0"]'
    }
    response = requests.post(url, headers=headers, data=data)
    test = response.json()
    pages = round(int(test['context']['nb_results'])/27)
    if pages != 0:
        print(f'You are on this link:{url}, with {pages} pages')
    else:
        print("no pages")
    return url, pages


with ThreadPool(8) as pool:
    rv = pool.map(get_links, product(df['links'], df['country_id'], df['make'],
                                     df['model']))
    # This converts rv to the dict of the original post:
    formal_data = dict()
    filtered_list = [(url, pages) for url, pages in rv if pages]
    if filtered_list:
        formal_data['links'], formal_data['total_pages'] = zip(*filtered_list)
    else:  # Protect against empty answers
        formal_data['links'], formal_data['total_pages'] = [], []

As for why this is not producing any output: at last with the data provided in the question, test['context']['nb_results'] is 0 every time. It is very likely that your query returns zero items each time even with the full dataset.

Some other comments:

The use of m̀ultiprocessing.pool.ThreadPool is not recommended: you should switch to concurrent.futures.ThreadPoolExecutor.
You are not using threadLocal at all: it can be removed. I do not know what you would use it for.
You are importing threading but not using it.
request responses have a json method that parses the text right away: there is no need for importing json in this case.
It is very likely that you want to ceil instead of round the number of pages.
Since you are waiting for I/O, it is OK to use more threads than available cores.

Concurrency multithreading with requests

Answers (2)

Related Questions