Reputation: 150
I'm trying to compare two lists of MD5 hashes and identify matches. One of these lists contains approximately 34,000,000 hashes, and the other could contain up to 1,000,000.
Using Numpy I've been experimenting with the time is takes to conduct the comparison vs a standard python array, and the performance difference is very impressive. The experimental hash datasets ive been using only contain 1,000,000 each, but when I try and simulate the target dataset of 34,000,000, the python script is returning the following error:
process started - [2017-01-01 11:23:18]
Traceback (most recent call last):
File "compare_data.py", line 30, in <module>
compare_array_to_array()
File "compare_data.py", line 24, in compare_array_to_array
np_array_01 = read_sample_data_01_array()
File "compare_data.py", line 9, in read_sample_data_01_array
np_array_01 = np.array(sample_data_01)
MemoryError
I've had a look at other posts regards Numpy Memory Errors but I'm struggling to understand how the problem is resolved, so I apologize in advance that this question may have been asked before.
The full script is as follows:
from datetime import datetime
import numpy as np
def read_sample_data_01_array():
sample_data_01 = []
with open("data_set_01.txt", "r") as fi: #34,000,000 hashes
for line in fi:
sample_data_01.append(line)
np_array_01 = np.array(sample_data_01)
return(np_array_01)
def read_sample_data_02_array():
sample_data_02 = []
with open("data_set_02.txt", "r") as fi: #1,000,000 hashes
for line in fi:
sample_data_02.append(line)
np_array_02 = np.array(sample_data_02)
return(np_array_02)
def compare_array_to_array():
np_array_02 = read_sample_data_02_array()
np_array_01 = read_sample_data_01_array()
ct = np.sum(np_array_01 == np_array_02)
print(ct)
print("process started - [{0}]".format(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))
compare_array_to_array()
print("process completed - [{0}]".format(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))
The current workstation that this code is running from is 32bit as is Python, that said Ubuntu can see 8GB of RAM, although I suspect only 4 of this is addressable? The target workstation contains 64GB of RAM and is a 64bit system, however I would like to cater for the lesser system
Heres an example of the strings contained within the datasets I'm trying to compare:
XDPUXSRCYCYTYEUPQMGXEDPHQTDOCLEK
FPNXWZTALSZULLWLPCMTGTAKIGMCAMFT
NKIOLIINPXBMUUOLYKWWBCIXCIVKWCPO
DPLIXJCFJOKLSZUSRPPBDBAADXEHEDZL
NGIMMXKXZHIQHTCXRPKGWYPUPJMAJAPQ
Many thanks
Upvotes: 2
Views: 2719
Reputation: 140286
With that routine:
def read_sample_data_01_array():
sample_data_01 = []
with open("data_set_01.txt", "r") as fi: #34,000,000 hashes
for line in fi:
sample_data_01.append(line)
np_array_01 = np.array(sample_data_01)
return(np_array_01)
you're creating a native python list first, then you create a numpy
array with that. At some point, the memory you need is roughly the double of the final needed memory, which may explain you run out of memory.
You could read the file directly using numpy.loadtxt
but I suspect that it takes the same amount of memory because it computes data size automatically, so here's my solution: using fromiter
and specify data type as "<U32"
so numpy knows it allocates at most 32 bytes per cell (since it cannot know the maximum size in advance because of the iteration process, which is also why it saves memory):
def read_sample_data_array(filepath):
with open(filepath) as f:
array = np.fromiter((line.strip() for line in f),dtype="<U32")
return array
(note the generator comprehension expression (line.strip() for line in f)
which doesn't allocate memory at all unlike [line.strip() for line in f]
would.
then:
np_array_02 = read_sample_data_array("data_set_02.txt")
np_array_01 = read_sample_data_array("data_set_01.txt")
However, this still takes too much memory, so let me propose a non-numpy alternative, using only base python:
with open("data_set_02.txt", "r") as fi: #1,000,000 hashes
s = {line.strip() for line in fi}
with open("data_set_01.txt", "r") as fi: #34,000,000 hashes
number_of_matches = sum(1 for line in fi if line.strip() in s)
set
for fast lookupsum
and a conditional generator comprehensionThat should do the trick and be relatively fast even if the big data set is even bigger. You just don't need the big dataset to be in memory.
Upvotes: 1