Simon Ridley
Simon Ridley

Reputation: 150

Python Numpy Memory Error

I'm trying to compare two lists of MD5 hashes and identify matches. One of these lists contains approximately 34,000,000 hashes, and the other could contain up to 1,000,000.

Using Numpy I've been experimenting with the time is takes to conduct the comparison vs a standard python array, and the performance difference is very impressive. The experimental hash datasets ive been using only contain 1,000,000 each, but when I try and simulate the target dataset of 34,000,000, the python script is returning the following error:

process started - [2017-01-01 11:23:18]
Traceback (most recent call last):
File "compare_data.py", line 30, in <module>
compare_array_to_array()
File "compare_data.py", line 24, in compare_array_to_array
np_array_01 = read_sample_data_01_array()
File "compare_data.py", line 9, in read_sample_data_01_array
np_array_01 = np.array(sample_data_01)
MemoryError

I've had a look at other posts regards Numpy Memory Errors but I'm struggling to understand how the problem is resolved, so I apologize in advance that this question may have been asked before.

The full script is as follows:

from datetime import datetime
import numpy as np

def read_sample_data_01_array():
    sample_data_01 = []
    with open("data_set_01.txt", "r") as fi: #34,000,000 hashes
        for line in fi:
            sample_data_01.append(line)
    np_array_01 = np.array(sample_data_01)
    return(np_array_01)


def read_sample_data_02_array():
    sample_data_02 = []
    with open("data_set_02.txt", "r") as fi: #1,000,000 hashes
        for line in fi:
            sample_data_02.append(line)
    np_array_02 = np.array(sample_data_02)
    return(np_array_02)


def compare_array_to_array():
    np_array_02 = read_sample_data_02_array()
    np_array_01 = read_sample_data_01_array()
    ct = np.sum(np_array_01 == np_array_02)
    print(ct)


print("process started - [{0}]".format(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))    
compare_array_to_array()
print("process completed - [{0}]".format(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))

The current workstation that this code is running from is 32bit as is Python, that said Ubuntu can see 8GB of RAM, although I suspect only 4 of this is addressable? The target workstation contains 64GB of RAM and is a 64bit system, however I would like to cater for the lesser system

Heres an example of the strings contained within the datasets I'm trying to compare:

XDPUXSRCYCYTYEUPQMGXEDPHQTDOCLEK
FPNXWZTALSZULLWLPCMTGTAKIGMCAMFT
NKIOLIINPXBMUUOLYKWWBCIXCIVKWCPO
DPLIXJCFJOKLSZUSRPPBDBAADXEHEDZL
NGIMMXKXZHIQHTCXRPKGWYPUPJMAJAPQ

Many thanks

Upvotes: 2

Views: 2719

Answers (1)

Jean-Fran&#231;ois Fabre
Jean-Fran&#231;ois Fabre

Reputation: 140286

With that routine:

def read_sample_data_01_array():
    sample_data_01 = []
    with open("data_set_01.txt", "r") as fi: #34,000,000 hashes
        for line in fi:
            sample_data_01.append(line)
    np_array_01 = np.array(sample_data_01)
    return(np_array_01)

you're creating a native python list first, then you create a numpy array with that. At some point, the memory you need is roughly the double of the final needed memory, which may explain you run out of memory.

You could read the file directly using numpy.loadtxt but I suspect that it takes the same amount of memory because it computes data size automatically, so here's my solution: using fromiter and specify data type as "<U32" so numpy knows it allocates at most 32 bytes per cell (since it cannot know the maximum size in advance because of the iteration process, which is also why it saves memory):

def read_sample_data_array(filepath):
   with open(filepath) as f:
    array = np.fromiter((line.strip() for line in f),dtype="<U32")
   return array

(note the generator comprehension expression (line.strip() for line in f) which doesn't allocate memory at all unlike [line.strip() for line in f] would.

then:

np_array_02 = read_sample_data_array("data_set_02.txt")
np_array_01 = read_sample_data_array("data_set_01.txt")

However, this still takes too much memory, so let me propose a non-numpy alternative, using only base python:

with open("data_set_02.txt", "r") as fi: #1,000,000 hashes
    s = {line.strip() for line in fi}

with open("data_set_01.txt", "r") as fi: #34,000,000 hashes
    number_of_matches = sum(1 for line in fi if line.strip() in s)
  1. read the "small" data set and put it in a set for fast lookup
  2. read the "big" data set line by line (so no memory consumption) and count matches using sum and a conditional generator comprehension

That should do the trick and be relatively fast even if the big data set is even bigger. You just don't need the big dataset to be in memory.

Upvotes: 1

Related Questions