Stackbeans
Stackbeans

Reputation: 279

Concurrency multithreading with requests

I am trying to figure out how to create concurrent requests with multithreading whilst using the requests library. I want to grab the links, and total pages from a url's POST request.

However, I am iterating over a very large loop so it will take an awfully long time. What I have tried doesn't seem to make the requests concurrent nor does it produce an output.

Here's what I have tried:

#smaller subset of my data

df = {'links': ['https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D687',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D492',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D499',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D702',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D6143'],
 'make': [138.0,138.0,138.0,138.0,138.0],
 'model': [687.0,492.0,499.0,702.0,6143.0],
 'country_id': [6.0,6.0,6.0,6.0,6.0]}

import requests
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
import threading
import gc



def get_links(url):
    headers = {
        'authority': 'www.theparking.eu',
        'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
        'accept': '*/*',
        'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'x-requested-with': 'XMLHttpRequest',
        'sec-ch-ua-mobile': '?0',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
        'sec-ch-ua-platform': '"macOS"',
        'origin': 'https://www.theparking.eu',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'cors',
        'sec-fetch-dest': 'empty',
        'referer': 'https://www.theparking.eu/used-cars/used-cars/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }
    formal_data = defaultdict(list)
    for id_ in df['country_id']:
        for make in df['make']:
            for model in df['model']:
                data = {
                    'ajax': '{"tab_id":"t0","cur_page":1,"cur_trie":"distance","query":"","critere":{"id_pays":[%s],"id_marque":[%s], "id_modele":[%s]},"sliders":{"prix":{"id":"#range_prix","face":"prix","max_counter":983615,"min":"1","max":"400000"},"km":{"id":"#range_km","face":"km","max_counter":1071165,"min":"1","max":"500000"},"millesime":{"id":"#range_millesime","face":"millesime","max_counter":1163610,"min":"1900","max":"2022"}},"req_num":1,"nb_results":"11795660","current_location_distance":-1,"logged_in":false}' % (round(id_), round(make), round(model)),
                    'tabs': '["t0"]'
                            }
                response = requests.post(url, headers=headers, data=data)
                test = json.loads(response.text)
                pages = round(int(test['context']['nb_results'])/27)
                if pages != 0:
                    formal_data['total_pages'].append(pages)
                    formal_data['links'].append(url)
                    print(f'You are on this link:{url}')
    return formal_data
threadLocal = threading.local()

with ThreadPool(8) as pool:
    urls = df['links']
    pool.map(get_links, urls)
    # must be done before terminate is explicitly or implicitly called on the pool:
    del threadLocal
    gc.collect()

Upvotes: 1

Views: 3129

Answers (2)

azelcer
azelcer

Reputation: 1533

Note that a more modern approach to use requests asynchronously is to use other libraries, like requests-threads.

With your approach, you connect to various URL in parallel, but sequentially to each URL. Consequentially, you might no be taking full advantage of multithreading. Indeed, for a single URL in df['links'], you get the same results as a single thread. The easiest way to circumvent this is to use itertools.product, that makes an iterator of what would be otherwise a nested loop.

import requests
from concurrent.futures import ThreadPoolExecutor as ThreadPool
from itertools import product

#   ... snipped df definition ...

def get_links(packed_pars):
    url, id_, make, model = packed_pars
    headers = {
        'authority': 'www.theparking.eu',
        'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97","Chromium";v="97"',
        'accept': '*/*',
        'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'x-requested-with': 'XMLHttpRequest',
        'sec-ch-ua-mobile': '?0',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
        'sec-ch-ua-platform': '"macOS"',
        'origin': 'https://www.theparking.eu',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'cors',
        'sec-fetch-dest': 'empty',
        'referer': 'https://www.theparking.eu/used-cars/used-cars/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }
    data = {
        'ajax': '{"tab_id":"t0","cur_page":1,"cur_trie":"distance","query":"","critere":{"id_pays":[%s],"id_marque":[%s], "id_modele":[%s]},"sliders":{"prix":{"id":"#range_prix","face":"prix","max_counter":983615,"min":"1","max":"400000"},"km":{"id":"#range_km","face":"km","max_counter":1071165,"min":"1","max":"500000"},"millesime":{"id":"#range_millesime","face":"millesime","max_counter":1163610,"min":"1900","max":"2022"}},"req_num":1,"nb_results":"11795660","current_location_distance":-1,"logged_in":false}' % (round(id_), round(make), round(model)),
        'tabs': '["t0"]'
    }
    response = requests.post(url, headers=headers, data=data)
    test = response.json()
    pages = round(int(test['context']['nb_results'])/27)
    if pages != 0:
        print(f'You are on this link:{url}, with {pages} pages')
    else:
        print("no pages")
    return url, pages


with ThreadPool(8) as pool:
    rv = pool.map(get_links, product(df['links'], df['country_id'], df['make'],
                                     df['model']))
    # This converts rv to the dict of the original post:
    formal_data = dict()
    filtered_list = [(url, pages) for url, pages in rv if pages]
    if filtered_list:
        formal_data['links'], formal_data['total_pages'] = zip(*filtered_list)
    else:  # Protect against empty answers
        formal_data['links'], formal_data['total_pages'] = [], []

As for why this is not producing any output: at last with the data provided in the question, test['context']['nb_results'] is 0 every time. It is very likely that your query returns zero items each time even with the full dataset.

Some other comments:

  • The use of m̀ultiprocessing.pool.ThreadPool is not recommended: you should switch to concurrent.futures.ThreadPoolExecutor.
  • You are not using threadLocal at all: it can be removed. I do not know what you would use it for.
  • You are importing threading but not using it.
  • request responses have a json method that parses the text right away: there is no need for importing json in this case.
  • It is very likely that you want to ceil instead of round the number of pages.
  • Since you are waiting for I/O, it is OK to use more threads than available cores.

Upvotes: 3

kamster
kamster

Reputation: 144

so one thing is that for programs like using web apis where they are I/O bound (here where the performance hit taking is waiting on requests from another machine/server/etc), the more general approach is to using async programming. a good library for async http requests is httpx (there are others as well). you'll find interface of these libraries similar to requests along with allowing to be able to do async or sync, so should be easy transition to use. from there will want to learn about async pogramming in python. the quickstart and async along with other good tutorials can find via google on general python async programming.

can see that this is approach other python http wrapper libraries do like asyncpraw

Just as quick note on why async is nice vs multiprocessing. is that:

  1. async essentially allows with a single process/thread executes other parts of program as other pars on waiting on output, so basically feels as if all of the code is executing in parrellel
  2. multiprocessing is actually kicking off separate processes (i am paraphrasing a bit but thats the gist) and likely wont get same performance gains like would in async.

Upvotes: 1

Related Questions