Reputation: 8518
I am mining data from a website through Data Scraping in Python. I am using request
package for sending the parameters.
Here is the code snippet in Python:
for param in paramList:
data = get_url_data(param)
def get_url_data(param):
post_data = get_post_data(param)
headers = {}
headers["Content-Type"] = "text/xml; charset=UTF-8"
headers["Content-Length"] = len(post_data)
headers["Connection"] = 'Keep-Alive'
headers["Cache-Control"] = 'no-cache'
page = requests.post(url, data=post_data, headers=headers, timeout=10)
data = parse_page(page.content)
return data
The variable paramList
is a list of more than 1000 elements and the endpoint url
remains the same. I was wondering if there is a better and more faster way to do this ?
Thanks
Upvotes: 1
Views: 2476
Reputation: 3055
I need to make 1000 post request to same domain, I was wondering if there is a better and more faster way to do this ?
It depends, if it's a static asset or a servlet which you know what it does, if the same parameters will return the same reponse each time you can implement LRU or some other caching mechanism, if not, 1K of POST requests to some servlet doesn't matter even if they have the same domain.
There is an answer with using multiprocessing
whith ThreadPool
interface, which actually uses the main process with 15 threads, does it runs on 15 cores machine ? because a core can only run one thread each time (except hyper ones, does it run on 8 hyper-cores?)
ThreadPool
interface inside library which has a trivial name, multiprocessing
, because python has also threading
module, this is confusing as f#ck, lets benchmark some lower level code:
import psutil
from multiprocessing.pool import ThreadPool
from time import sleep
def get_url_data(param):
print(param) # just for convenience
sleep(1) # lets assume it will take one second each time
if __name__ == '__main__':
paramList = [i for i in range(100)] # 100 urls
pool = ThreadPool(psutil.cpu_count()) # each core can run one thread (hyper.. not now)
pool.map(get_url_data, paramList) # splitting the jobs
pool.close()
The code above will use the main process with 4 threads in my case because my laptop has 4 CPUs, benchmark result:
$ time python3_5_2 test.py
real 0m28.127s
user 0m0.080s
sys 0m0.012s
Lets try spawning processes w/ multiprocessing
import psutil
import multiprocessing
from time import sleep
import numpy
def get_url_data(urls):
for url in urls:
print(url)
sleep(1) # lets assume it will take one second each time
if __name__ == "__main__":
jobs = []
# Split URLs into chunks as number of CPUs
chunks = numpy.array_split(range(100), psutil.cpu_count())
# Pass each chunk into process
for url_chunk in chunks:
jobs.append(multiprocessing.Process(target=get_url_data, args=(url_chunk, )))
# Start the processes
for j in jobs:
j.start()
# Ensure all of the processes have finished
for j in jobs:
j.join()
Benchmark result: less 3 seconds
$ time python3_5_2 test2.py
real 0m25.208s
user 0m0.260s
sys 0m0.216
If you will execute ps -aux | grep "test.py"
you will see 5 processes because one is the main which manage the others.
There are some drawbacks:
You did not explain in depth what your code is doing, but if you doing some work which needs to be synchronized you need to know multiprocessing is NOT thread safe.
Spawning extra processes introduces I/O overhead as data is having to be shuffled around between processors.
Assuming the data is restricted to each process, it is possible to gain significant speedup, be aware of Amdahl's Law.
If you will reveal what your code does afterwards ( save it into file ? database ? stdout ? ) it will be easier to give better answer/direction, few ideas comes up to my mind like immutable infrastructure with Bash or Java to handle synchronization or is it a memory-bound issue and you need an objects pool to process the JSON responses.. might even be a job for fault tolerance Elixir)
Upvotes: 2
Reputation: 2375
As there is a significant amount of networking I/O involved, threading should improve the overall performance significantly.
You can try using a ThreadPool
and should test and tweak the number of threads to a one that is best suitable for the situation and shows the overall highest performance .
from multiprocessing.pool import ThreadPool
# Remove 'for param in paramList' iteration
def get_url_data(param):
# Rest of code here
if __name__ == '__main__':
pool = ThreadPool(15)
pool.map(get_url_data, paramList) # Will split the load between the threads nicely
pool.close()
Upvotes: 2