U. J. V.
U. J. V.

Reputation: 25

Improve Python scraping code with multithreading

I am writing a simple python crawler using urllib2, bsoup, csv... (Python 2.7) I have a .csv file where are stored url links which needs to be scraped.

From the code below, I am crawling specific number from links where it's finding maximum one of attenders from the website and crawl(url) function is working properly just like whole code.

from bs4 import BeautifulSoup
import json, csv, urllib2, urllib, re, time, lxml

def crawl(url):
    request = urllib2.Request(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"})
    response = urllib2.urlopen(request)
    readd = response.read()

    soup = BeautifulSoup(readd, "lxml")
    response.close()
    maxx = 0
    if (soup.find("div", attrs={"class" : "attendees-placeholder placeholder"})):
        exists = soup.find("div", attrs={"class" : "attendees-placeholder placeholder"})
        nmb = exists.find("ul", "user-list")
        numbe = nmb.find_all("li")
        number = len(numbe)
        if (number > maxx):
            maxx = number
    else:
        number = 0

    print maxx



urls = csv.reader(open('all_links_2017.csv'))


for url in urls:
    crawl(url[0])

Meanwhile, it's going too slow because I have around 100000 urls. I have tried many multithreading samples but it was not what I expected. Is there any way to improve this code so it can go faster? (i.e. multithreading, pool...)

Upvotes: 2

Views: 988

Answers (2)

pstatix
pstatix

Reputation: 3848

And you've tried?:

import threading

def crawl(url, sem):
    # Semaphore grabs a thread
    sem.acquire(blocking=False)
    # Your code here
    .
    .
    .
    # All the work is done (i.e. after print maxx)
    sem.release() 

sem = threading.Semaphore(4)
threads = [threading.Thread(target=crawl, args=(url, sem, )) for url in urls]

for thread in threads:
    thread.start()

Edit: Changed first for to list comprehension.

Edit: Added threading.Semaphore() limiting method. A Semaphore is a limiter (in essence its a counter of threads) to keep track of the number of threads running concurrently. In this case, the value is set to a maximum of 4 threads at any given time. This can also be used with the with context manager if you chose to use a BoundedSemaphore().

Upvotes: 1

Roland Smith
Roland Smith

Reputation: 43505

Use a multiprocessing.Pool. Change crawl to return maxx instead of print it. Then use the multiprocessing.Pool.imap_unordered method.

p = multiprocessing.Pool()
urls = csv.reader(open('all_links_2017.csv'))
for value in p.imap_unordered(crawl, [u[0] for u in urls]):
    print(value)

By default, this will create as many worker processes as your CPU has cores.

Upvotes: 1

Related Questions