Reputation: 25
I am writing a simple python crawler using urllib2, bsoup, csv... (Python 2.7) I have a .csv file where are stored url links which needs to be scraped.
From the code below, I am crawling specific number from links where it's finding maximum one of attenders from the website and crawl(url)
function is working properly just like whole code.
from bs4 import BeautifulSoup
import json, csv, urllib2, urllib, re, time, lxml
def crawl(url):
request = urllib2.Request(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"})
response = urllib2.urlopen(request)
readd = response.read()
soup = BeautifulSoup(readd, "lxml")
response.close()
maxx = 0
if (soup.find("div", attrs={"class" : "attendees-placeholder placeholder"})):
exists = soup.find("div", attrs={"class" : "attendees-placeholder placeholder"})
nmb = exists.find("ul", "user-list")
numbe = nmb.find_all("li")
number = len(numbe)
if (number > maxx):
maxx = number
else:
number = 0
print maxx
urls = csv.reader(open('all_links_2017.csv'))
for url in urls:
crawl(url[0])
Meanwhile, it's going too slow because I have around 100000 urls. I have tried many multithreading samples but it was not what I expected. Is there any way to improve this code so it can go faster? (i.e. multithreading, pool...)
Upvotes: 2
Views: 988
Reputation: 3848
And you've tried?:
import threading
def crawl(url, sem):
# Semaphore grabs a thread
sem.acquire(blocking=False)
# Your code here
.
.
.
# All the work is done (i.e. after print maxx)
sem.release()
sem = threading.Semaphore(4)
threads = [threading.Thread(target=crawl, args=(url, sem, )) for url in urls]
for thread in threads:
thread.start()
Edit: Changed first for
to list comprehension.
Edit: Added threading.Semaphore()
limiting method. A Semaphore is a limiter (in essence its a counter of threads) to keep track of the number of threads running concurrently. In this case, the value is set to a maximum of 4 threads at any given time. This can also be used with the with
context manager if you chose to use a BoundedSemaphore()
.
Upvotes: 1
Reputation: 43505
Use a multiprocessing.Pool
. Change crawl
to return maxx
instead of print it. Then use the multiprocessing.Pool.imap_unordered
method.
p = multiprocessing.Pool()
urls = csv.reader(open('all_links_2017.csv'))
for value in p.imap_unordered(crawl, [u[0] for u in urls]):
print(value)
By default, this will create as many worker processes as your CPU has cores.
Upvotes: 1