Kiran
Kiran

Reputation: 8518

Handling multiple http request in Python

I am mining data from a website through Data Scraping in Python. I am using request package for sending the parameters.

Here is the code snippet in Python:

   for param in paramList:
    data = get_url_data(param)


def get_url_data(param):    
    post_data = get_post_data(param)

    headers = {}
    headers["Content-Type"] = "text/xml; charset=UTF-8"
    headers["Content-Length"] = len(post_data)
    headers["Connection"] = 'Keep-Alive'
    headers["Cache-Control"] = 'no-cache'

    page = requests.post(url, data=post_data, headers=headers, timeout=10)
    data = parse_page(page.content)
    return data

The variable paramList is a list of more than 1000 elements and the endpoint url remains the same. I was wondering if there is a better and more faster way to do this ?

Thanks

Upvotes: 1

Views: 2476

Answers (2)

James W.
James W.

Reputation: 3055

I need to make 1000 post request to same domain, I was wondering if there is a better and more faster way to do this ?

It depends, if it's a static asset or a servlet which you know what it does, if the same parameters will return the same reponse each time you can implement LRU or some other caching mechanism, if not, 1K of POST requests to some servlet doesn't matter even if they have the same domain.

There is an answer with using multiprocessing whith ThreadPool interface, which actually uses the main process with 15 threads, does it runs on 15 cores machine ? because a core can only run one thread each time (except hyper ones, does it run on 8 hyper-cores?)

ThreadPool interface inside library which has a trivial name, multiprocessing, because python has also threading module, this is confusing as f#ck, lets benchmark some lower level code:

import psutil
from multiprocessing.pool import ThreadPool
from time import sleep  

def get_url_data(param):
    print(param)  # just for convenience
    sleep(1)  # lets assume it will take one second each time

if __name__ == '__main__':
    paramList = [i for i in range(100)]  # 100 urls
    pool = ThreadPool(psutil.cpu_count())  # each core can run one thread (hyper.. not now)
    pool.map(get_url_data, paramList)  # splitting the jobs
    pool.close()

The code above will use the main process with 4 threads in my case because my laptop has 4 CPUs, benchmark result:

$ time python3_5_2 test.py
real    0m28.127s
user    0m0.080s
sys     0m0.012s

Lets try spawning processes w/ multiprocessing

import psutil
import multiprocessing
from time import sleep
import numpy

def get_url_data(urls):
    for url in urls:
        print(url)
        sleep(1)  # lets assume it will take one second each time

if __name__ == "__main__":
    jobs = []

    # Split URLs into chunks as number of CPUs
    chunks = numpy.array_split(range(100), psutil.cpu_count())

    # Pass each chunk into process
    for url_chunk in chunks:
        jobs.append(multiprocessing.Process(target=get_url_data, args=(url_chunk, )))

    # Start the processes
    for j in jobs:
        j.start()

    # Ensure all of the processes have finished
    for j in jobs:
        j.join()

Benchmark result: less 3 seconds

$ time python3_5_2 test2.py
real    0m25.208s
user    0m0.260s
sys     0m0.216

If you will execute ps -aux | grep "test.py" you will see 5 processes because one is the main which manage the others.

There are some drawbacks:

  • You did not explain in depth what your code is doing, but if you doing some work which needs to be synchronized you need to know multiprocessing is NOT thread safe.

  • Spawning extra processes introduces I/O overhead as data is having to be shuffled around between processors.

  • Assuming the data is restricted to each process, it is possible to gain significant speedup, be aware of Amdahl's Law.

If you will reveal what your code does afterwards ( save it into file ? database ? stdout ? ) it will be easier to give better answer/direction, few ideas comes up to my mind like immutable infrastructure with Bash or Java to handle synchronization or is it a memory-bound issue and you need an objects pool to process the JSON responses.. might even be a job for fault tolerance Elixir)

Upvotes: 2

Evya
Evya

Reputation: 2375

As there is a significant amount of networking I/O involved, threading should improve the overall performance significantly.
You can try using a ThreadPool and should test and tweak the number of threads to a one that is best suitable for the situation and shows the overall highest performance .

from multiprocessing.pool import ThreadPool

# Remove 'for param in paramList' iteration

def get_url_data(param):    
    # Rest of code here

if __name__ == '__main__':
    pool = ThreadPool(15)
    pool.map(get_url_data, paramList) # Will split the load between the threads nicely
    pool.close()

Upvotes: 2

Related Questions