Reputation: 31
I have wrote a python script to open arround 1k urls and process them to get the desired result,but it seems like eventhough multithreading has been introduced its working slowly, and after some urls have been processed,the process seems to be hanged, I am not able to decide whether its still running or stopped.How can I create multiple threads to process them faster.Any help will be highly appreciated.Thanks in Advance. Below is my script.
import threading
from multiprocessing.pool import ThreadPool
from selenium import webdriver
from selenium.webdriver.phantomjs.service import Service
from selenium.webdriver.common.desired_capabilities import
DesiredCapabilities
from selenium.webdriver.remote.webdriver import WebDriver as
RemoteWebDriver
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
import csv
def fetch_url(url):
driver = webdriver.PhantomJS()
driver.get(url)
html = driver.page_source
print(html)
print("'%s\' fetched in %ss" % (url[0], (time.time() - start)))
def thread_task(lock,data_set):
lock.acquire()
fetch_url(url)
lock.release()
if __name__ == "__main__":
data_set = []
with open('file.csv', 'r') as csvfile:
spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in spamreader:
data_set.append(row)
lock = threading.Lock()
# data set will contain a list of 1k urls
for url in data_set:
t1 = threading.Thread(target=thread_task, args=(lock,url,))
# start threads
t1.start()
# wait until threads finish their job
t1.join()
print("Elapsed Time: %s" % (time.time() - start))
Upvotes: 1
Views: 1582
Reputation: 77367
You've defeated multithreading first by waiting for each thread to finish in the for url in data_set:
loop before starting the next, and then by using a lock to only let one instance of the the fetch_url
function run at a time. You've imported ThreadPool
, its a reasonable tool for the job. Here is how you could use it
import threading
from multiprocessing.pool import ThreadPool
from selenium import webdriver
from selenium.webdriver.phantomjs.service import Service
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.remote.webdriver import WebDriver as RemoteWebDriver
import csv
def fetch_url(url):
driver = webdriver.PhantomJS()
driver.get(url)
html = driver.page_source
print(html)
print("'%s\' fetched in %ss" % (url[0], (time.time() - start)))
def thread_task(lock,data_set):
lock.acquire()
fetch_url(url)
lock.release()
if __name__ == "__main__":
start = time.time()
with open('file.csv', 'r') as csvfile:
dataset = list(csv.reader(csvfile, delimiter=' ', quotechar='|'))
# guess a thread pool size which is a tradeoff of number of cpu cores,
# expected wait time for i/o and memory size.
with ThreadPool(20) as pool:
pool.map(fetch_url, dataset, chunksize=1)
print("Elapsed Time: %s" % (time.time() - start))
Upvotes: 2