Reputation: 133
I am trying to make a hash breaking application that will check all the lines of one file with all the lines in the rockyou dictionary. While with pre-hashing the rock you i got the time of checking one hash down to a few seconds its still not enough. This is why i am moving my program to multithreading. But my threads stop without rising any exceptions.
import threading
import datetime
class ThreadClass(threading.Thread):
hash_list=0
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
list_len=file_len("list.txt")
def run(self):
while ThreadClass.list_len>0:
ThreadClass.list_len=ThreadClass.list_len-1
print str(threading.current_thread())+":"+str(ThreadClass.list_len)
for i in range(20):
try:
t = ThreadClass()
t.start()
except:
raise
Here is output:
When i run it after some time there is only one thread reporting. Why?
Thanks for all help
EDIT: One of the threads rises a key error.I dont know what that is
Upvotes: 1
Views: 2818
Reputation: 69082
As calculating hashes is a CPU bound problems, using multithreading won't help you in cPython because of the GIL.
If anything, you need to use multiprocessing
. Using a Pool
, your whole code could be reduced to something like:
import multiprocessing
def calculate(line):
# ... calculate the hash ...
return (line, 'calculated_result')
pool = multiprocessing.Pool(multiprocessing.cpu_count())
with open('input.txt') as inputfile:
result = pool.map(calculate, inputfile)
print(result)
# compare results
As to your problem with the threads: You're concurrently accessing ThreadClass.list_len
from multiple theads.
First you access it and compare it to 0. Then you access it again, decrease it and store it back, which is not thread safe
And then you access it again when you print it. Between any of these operations, another thread could modify the value.
To show this, I've modified your code a little:
import threading
import datetime
lns = []
class ThreadClass(threading.Thread):
hash_list=0
list_len= 10000
def run(self):
while ThreadClass.list_len>0:
ThreadClass.list_len=ThreadClass.list_len-1
ln = ThreadClass.list_len # copy for later use ...
lns.append(ln)
threads = []
for i in range(20):
t = ThreadClass()
t.start()
threads.append(t)
for t in threads:
t.join()
print len(lns), len(set(lns)), min(lns)
When I run this 10 times, what i get is:
13473 9999 -1
10000 10000 0
10000 10000 0
12778 10002 -2
10140 10000 0
10000 10000 0
15579 10000 -1
10866 9996 0
10000 10000 0
10164 9999 -1
So sometimes it seems to run ok, but others there are a lot of values that have been added multiple times, and list_len even manages to get negative.
If you disassemble the run method, you'll see this:
>>> dis.dis(ThreadClass.run)
11 0 SETUP_LOOP 57 (to 60)
>> 3 LOAD_GLOBAL 0 (ThreadClass)
6 LOAD_ATTR 1 (list_len)
9 LOAD_CONST 1 (0)
12 COMPARE_OP 4 (>)
15 POP_JUMP_IF_FALSE 59
12 18 LOAD_GLOBAL 0 (ThreadClass)
21 LOAD_ATTR 1 (list_len)
24 LOAD_CONST 2 (1)
27 BINARY_SUBTRACT
28 LOAD_GLOBAL 0 (ThreadClass)
31 STORE_ATTR 1 (list_len)
13 34 LOAD_GLOBAL 0 (ThreadClass)
37 LOAD_ATTR 1 (list_len)
40 STORE_FAST 1 (ln)
14 43 LOAD_GLOBAL 2 (lns)
46 LOAD_ATTR 3 (append)
49 LOAD_FAST 1 (ln)
52 CALL_FUNCTION 1
55 POP_TOP
56 JUMP_ABSOLUTE 3
>> 59 POP_BLOCK
>> 60 LOAD_CONST 0 (None)
63 RETURN_VALUE
Simplified you can say, between any of these lines another thread could run and modify something. To safely access a value from multiple threads, you need to synchronize the access.
For example using threading.Lock
the code could be modified like this:
class ThreadClass(threading.Thread):
# ...
lock = threading.Lock()
def run(self):
while True:
with self.lock:
# code accessing shared variables inside lock
if ThreadClass.list_len <= 0:
return
ThreadClass.list_len -= 1
list_len = ThreadClass.list_len # store for later use...
# not accessing shared state, outside of lock
I'm not entirely sure that this is the cause of your problem, but it may be, specially if you're also reading from an input file in your run method.
Upvotes: 4