Flex Texmex
Flex Texmex

Reputation: 1084

Why can I not speed up crawling with multithreading in python?

Down below you see a blue print of my crawler. I thought I could speed it up with multithreading but I can't. Often times when I load a page the webserver is slow and then it would be nice to crawl another webpage that loads faster with multithreading. But it isn't faster. Why?

def start_it():
    while(True):
        get_urls()

def get_urls():
   response = urllib2.urlopen(url)
   page_source = str(response.read())

pool = ThreadPool(10)

pool.map(start_it())

Ok I tested if the threads run parallel and they are not :/ What am I doing wrong?

def start_it():

    x = random.random()
    while(True):
        get_urls(x)

def get_urls(x):
    print(x)

pool = ThreadPool(10)

pool.map(start_it())

I know this because the output is always the same:

0.1771815430790964
0.1771815430790964
0.1771815430790964
0.1771815430790964
0.1771815430790964
0.1771815430790964
0.1771815430790964

Upvotes: 1

Views: 145

Answers (4)

csl
csl

Reputation: 11358

If the code you posted actualy runs, you shouldn't do pool.map(start_it()), as that calls start_it before passing the result to pool.map. Instead you must pass start_it without any (), as in pool.map(start_it). You pprobably need another argument as well (values to pass to start_it).

You can try the example below, which seems to work for me.

import json
import multiprocessing.pool
import time
import urllib2

def run(no):
    for n in range(3):
        f = urllib2.urlopen("http://time.jsontest.com")
        data = json.loads(f.read())
        f.close()
        print("thread %d: %s" % (no, data))
        time.sleep(1)

pool = multiprocessing.pool.ThreadPool(3)
pool.map(run, range(3))

You could also use multiprocess.Process, e.g.:

import multiprocessing
import time
import os

def run(jobno):
    for n in range(3):
        print("job=%d pid=%d" % (jobno, os.getpid()))
        time.sleep(1)

jobs = []
for n in range(3):
    p = multiprocessing.Process(target=run, args=[n])
    jobs.append(p)

map(lambda x: x.start(), jobs)
map(lambda x: x.join(), jobs)

Example output:

job=0 pid=18347
job=1 pid=18348
job=2 pid=18349
job=0 pid=18347
job=2 pid=18349
job=1 pid=18348
job=2 pid=18349
job=0 pid=18347
job=1 pid=18348

Everything under the multiprocessing module uses processes instead of threads, which are truly parallel. Just note that there might be some issues with that (versus running them as threads under the same process).

Upvotes: 0

Flex Texmex
Flex Texmex

Reputation: 1084

I think this way it is running truely parallel. I experienced a significant speed up of the crawling. Awesome ;)

import multiprocessing
import random

def worker():
    """worker function"""

    x = random.random()*10
    x = round(x)
    while(True):
        print(x , ' Worker')

if __name__ == '__main__':
    jobs = []
    for i in range(5):
        p = multiprocessing.Process(target=worker)
        jobs.append(p)
        p.start()

Upvotes: -1

hspandher
hspandher

Reputation: 16733

Not to digress, but Asynchronous IO is also a good candidate for your problem. You can use an amazing library called asyncio which has been recently added to python 3.4. For older versions you can use trollius or Twisted.

Upvotes: 0

scytale
scytale

Reputation: 12641

you need to provide pool.map() an iterable

at the moment you're running start_it() which basically runs all your calls one after another. I don't know what implementation of ThreadPool you are using but you probably need to do something like:

pool.map(get_urls, list_of_urls)

Upvotes: 2

Related Questions