Reputation: 1084
Down below you see a blue print of my crawler. I thought I could speed it up with multithreading but I can't. Often times when I load a page the webserver is slow and then it would be nice to crawl another webpage that loads faster with multithreading. But it isn't faster. Why?
def start_it():
while(True):
get_urls()
def get_urls():
response = urllib2.urlopen(url)
page_source = str(response.read())
pool = ThreadPool(10)
pool.map(start_it())
Ok I tested if the threads run parallel and they are not :/ What am I doing wrong?
def start_it():
x = random.random()
while(True):
get_urls(x)
def get_urls(x):
print(x)
pool = ThreadPool(10)
pool.map(start_it())
I know this because the output is always the same:
0.1771815430790964
0.1771815430790964
0.1771815430790964
0.1771815430790964
0.1771815430790964
0.1771815430790964
0.1771815430790964
Upvotes: 1
Views: 145
Reputation: 11358
If the code you posted actualy runs, you shouldn't do pool.map(start_it())
, as that calls start_it
before passing the result to pool.map
. Instead you must pass start_it
without any ()
, as in pool.map(start_it)
. You pprobably need another argument as well (values to pass to start_it).
You can try the example below, which seems to work for me.
import json
import multiprocessing.pool
import time
import urllib2
def run(no):
for n in range(3):
f = urllib2.urlopen("http://time.jsontest.com")
data = json.loads(f.read())
f.close()
print("thread %d: %s" % (no, data))
time.sleep(1)
pool = multiprocessing.pool.ThreadPool(3)
pool.map(run, range(3))
You could also use multiprocess.Process
, e.g.:
import multiprocessing
import time
import os
def run(jobno):
for n in range(3):
print("job=%d pid=%d" % (jobno, os.getpid()))
time.sleep(1)
jobs = []
for n in range(3):
p = multiprocessing.Process(target=run, args=[n])
jobs.append(p)
map(lambda x: x.start(), jobs)
map(lambda x: x.join(), jobs)
Example output:
job=0 pid=18347
job=1 pid=18348
job=2 pid=18349
job=0 pid=18347
job=2 pid=18349
job=1 pid=18348
job=2 pid=18349
job=0 pid=18347
job=1 pid=18348
Everything under the multiprocessing
module uses processes instead of threads, which are truly parallel. Just note that there might be some issues with that (versus running them as threads under the same process).
Upvotes: 0
Reputation: 1084
I think this way it is running truely parallel. I experienced a significant speed up of the crawling. Awesome ;)
import multiprocessing
import random
def worker():
"""worker function"""
x = random.random()*10
x = round(x)
while(True):
print(x , ' Worker')
if __name__ == '__main__':
jobs = []
for i in range(5):
p = multiprocessing.Process(target=worker)
jobs.append(p)
p.start()
Upvotes: -1
Reputation: 16733
Not to digress, but Asynchronous IO is also a good candidate for your problem. You can use an amazing library called asyncio which has been recently added to python 3.4. For older versions you can use trollius or Twisted.
Upvotes: 0
Reputation: 12641
you need to provide pool.map()
an iterable
at the moment you're running start_it()
which basically runs all your calls one after another. I don't know what implementation of ThreadPool
you are using but you probably need to do something like:
pool.map(get_urls, list_of_urls)
Upvotes: 2