Reputation: 371
I'm trying to measure four similarities(cosine_similarity, jaccard, Sequence Matcher similarity, jaccard_variants similarity) over 800K pairs of documents.
Every document file is txt format and about 100KB ~ 300KB(About 1500000 characters).
I have two questions regarding how to make my python scripts faster:
MY PYTHON SCRIPTS:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from difflib import SequenceMatcher
def get_tf_vectors(doc1, doc2):
text = [doc1, doc2]
vectorizer = CountVectorizer(text)
vectorizer.fit(text)
return vectorizer.transform(text).toarray()
def measure_sim(doc1, doc2):
a, b = doc1.split(), doc2.split()
c, d = set(a), set(b)
vectors = [t for t in get_tf_vectors(doc1, doc2)]
return cosine_similarity(vectors)[1][0], float(len(c&d) / len(c|d)), \
1 - (sum(abs(vectors[0] - vectors[1])) / sum(vectors[0] + vectors[1])), \
SequenceMatcher(None, a, b).ratio()
#items in doc_pair list are like('ID', 'doc1_directory', 'doc2_directory')
def data_analysis(doc_pair_list):
result = {}
for item in doc_pair_list:
f1 = open(item[1], 'rb')
doc1 = f1.read()
f1.close()
f2 = oepn(item[2], 'rb')
doc2 = f2.read()
f2.close()
result[item[0]] = measure_sim(doc1, doc2)
However, this code uses only 10% of my CPU and it takes almost 20 days to this task to be done. So I want to ask if there would be any way to make this code more efficient.
Q1. Since Documents are saved in HDD, I thought loading those text data should take some time. Hence, I suspect that loading only two documents every time the computer computes the similarities might not be efficient. Hence I am going to try loading 50 pairs of documents at once and computes similarity respectively. Would it be helpful?
Q2. Most of the postings about "How to make your codes run faster" said that I should use Python module based on C-code. However, since I'm using sklearn module which is known to be quite efficient, I wonder there would be any better way.
Is there any way that could help this python script to use more computer resources and become faster??
Upvotes: 0
Views: 754
Reputation: 624
There are maybe better solutions, but you may try something like this, if the counting of similarities is the blocker: 1) A separate process to read all the files one by one and put them to a multiprocessing.Queue 2) Pool of multiple worker processes to count the similarities and put results into multiprocessing.Queue. 3) Main thread then simply loads results from results_queue and save them to dictionary as you have it now.
I don't know your hardware limitations (number and speed of CPU cores, RAM size, disk read speed) and I don't have any samples to test it on. EDIT: Below is provided the described code. Please try and check if it is faster and let me know. If the main blocker is loading of files, we can create more loader processes (e.g. 2 processes and each loads half of the files). If the blocker is calculating similarities, then you can create more worker processes (just change worker_count). Finally 'results' is the dictionary with all the results.
import multiprocessing
import os
from difflib import SequenceMatcher
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def get_tf_vectors(doc1, doc2):
text = [doc1, doc2]
vectorizer = CountVectorizer(text)
vectorizer.fit(text)
return vectorizer.transform(text).toarray()
def calculate_similarities(doc_pairs_queue, results_queue):
""" Pick docs from doc_pairs_queue and calculate their similarities, save the result to results_queue. Repeat infinitely (until process is terminated). """
while True:
pair = doc_pairs_queue.get()
pair_id = pair[0]
doc1 = pair[1]
doc2 = pair[2]
a, b = doc1.split(), doc2.split()
c, d = set(a), set(b)
vectors = [t for t in get_tf_vectors(doc1, doc2)]
results_queue.put((pair_id, cosine_similarity(vectors)[1][0], float(len(c&d) / len(c|d)),
1 - (sum(abs(vectors[0] - vectors[1])) / sum(vectors[0] + vectors[1])),
SequenceMatcher(None, a, b).ratio()))
def load_files(doc_pair_list, loaded_queue):
"""
Pre-load files and put them to a queue, so working processes can get them.
:param doc_pair_list: list of files to be loaded (ID, doc1_path, doc2_path)
:param loaded_queue: multiprocessing.Queue that will hold pre-loaded data
"""
print("Started loading files...")
for item in doc_pair_list:
with open(item[1], 'rb') as f1:
with open(item[2], 'rb') as f2:
loaded_queue.put((item[0], f1.read(), f2.read())) # if queue is full, this automatically waits until there is space
print("Finished loading files.")
def data_analysis(doc_pair_list):
# create a loader process that will pre-load files (it does no calculations, so it loads much faster)
# loader puts loaded files to a queue; 1 pair ~ 500 KB, 1000 pairs ~ 500 MB max size of queue (RAM memory)
loaded_pairs_queue = multiprocessing.Queue(maxsize=1000)
loader = multiprocessing.Process(target=load_files, args=(doc_pair_list, loaded_pairs_queue))
loader.start()
# create worker processes - these will do all calculations
results_queue = multiprocessing.Queue(maxsize=1000) # workers put results to this queue
worker_count = os.cpu_count() if os.cpu_count() else 2 # number of worker processes
workers = [] # create list of workers, so we can terminate them later
for i in range(worker_count):
worker = multiprocessing.Process(target=calculate_similarities, args=(loaded_pairs_queue, results_queue))
worker.start()
workers.append(worker)
# main process just picks the results from queue and saves them to the dictionary
results = {}
i = 0 # results counter
pairs_count = len(doc_pair_list)
while i < pairs_count:
res = results_queue.get(timeout=600) # timeout is just in case something unexpected happened (results are calculated much quicker)
# Queue.get() is blocking - if queue is empty, get() waits until something is put into queue and then get it
results[res[0]] = res[1:] # save to dictionary by ID (first item in the result)
# clean up the processes (so there aren't any zombies left)
loader.terminate()
loader.join()
for worker in workers:
worker.terminate()
worker.join()
Let me know about the results please, I am quite interested in it and will assist you further if needed ;)
Upvotes: 2
Reputation: 817
First thing to do is see if you can find the real bottleneck and I think using cProfile might confirm your suspicion or shed some more light on your problem.
You should be able to run your code unmodified using cProfile like this:
python -m cProfile -o profiling-results python-file-to-test.py
After that you can analyze the results using pstats like this:
import pstats
stats = pstats.Stats("profiling-results")
stats.sort_stats("tottime")
stats.print_stats(10)
More on profiling your code is on Marco Bonazanin's blog article My Python Code is Slow? Tips for Profiling
Upvotes: 1