Improve Python scraping code with multithreading

Question

I am writing a simple python crawler using urllib2, bsoup, csv... (Python 2.7) I have a .csv file where are stored url links which needs to be scraped.

From the code below, I am crawling specific number from links where it's finding maximum one of attenders from the website and crawl(url) function is working properly just like whole code.

from bs4 import BeautifulSoup
import json, csv, urllib2, urllib, re, time, lxml

def crawl(url):
    request = urllib2.Request(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"})
    response = urllib2.urlopen(request)
    readd = response.read()

    soup = BeautifulSoup(readd, "lxml")
    response.close()
    maxx = 0
    if (soup.find("div", attrs={"class" : "attendees-placeholder placeholder"})):
        exists = soup.find("div", attrs={"class" : "attendees-placeholder placeholder"})
        nmb = exists.find("ul", "user-list")
        numbe = nmb.find_all("li")
        number = len(numbe)
        if (number > maxx):
            maxx = number
    else:
        number = 0

    print maxx



urls = csv.reader(open('all_links_2017.csv'))


for url in urls:
    crawl(url[0])

Meanwhile, it's going too slow because I have around 100000 urls. I have tried many multithreading samples but it was not what I expected. Is there any way to improve this code so it can go faster? (i.e. multithreading, pool...)

pstatix · Accepted Answer

And you've tried?:

import threading

def crawl(url, sem):
    # Semaphore grabs a thread
    sem.acquire(blocking=False)
    # Your code here
    .
    .
    .
    # All the work is done (i.e. after print maxx)
    sem.release() 

sem = threading.Semaphore(4)
threads = [threading.Thread(target=crawl, args=(url, sem, )) for url in urls]

for thread in threads:
    thread.start()

Edit: Changed first for to list comprehension.

Edit: Added threading.Semaphore() limiting method. A Semaphore is a limiter (in essence its a counter of threads) to keep track of the number of threads running concurrently. In this case, the value is set to a maximum of 4 threads at any given time. This can also be used with the with context manager if you chose to use a BoundedSemaphore().

Improve Python scraping code with multithreading

Answers (2)

Related Questions