Reputation: 704
I have a list of keywords and I want to validate if any of these keywords are inside a file containing more than 100,000 domain names. For faster processing, I want to implement multiprocessing so that each keyword can be validated in parallel.
My code doesn't seem to be working well as single processing is much faster. What's wrong? :(
import time
from multiprocessing import Pool
def multiprocessing_func(keyword):
# File containing more than 100k domain names
# URL: https://raw.githubusercontent.com/CERT-MZ/projects/master/Domain-squatting/domain-names.txt
file_domains = open("domain-names.txt", "r")
for domain in file_domains:
if keyword in domain:
print("similar domain identified:", domain)
# Rewind the file, start from the begining
file_domains.seek(0)
if __name__ == '__main__':
starttime = time.time()
# Keywords to check
keywords = ["google","facebook", "amazon", "microsoft", "netflix"]
# Create a multiprocessing Pool
pool = Pool()
for keyword in keywords:
print("Checking keyword:", keyword)
# Without multiprocessing pool
#multiprocessing_func(keyword)
# With multiprocessing pool
pool.map(multiprocessing_func, keyword)
# Total run time
print('That took {} seconds'.format(time.time() - starttime))
Upvotes: 0
Views: 82
Reputation: 70725
Think about why this program:
import multiprocessing as mp
def work(keyword):
print("working on", repr(keyword))
if __name__ == "__main__":
with mp.Pool(4) as pool:
pool.map(work, "google")
prints
working on 'g'
working on 'o'
working on 'o'
working on 'g'
working on 'l'
working on 'e'
map()
works on a sequence, and a string is a sequence. Instead of sticking the map()
call in a loop, you presumably want to invoke it only once with keywords
(the whole list) as its second argument.
Upvotes: 2