How to store (and access) a very large list in python

Question

I have a file with over 4 million lines that I want to read in, line by line, and perform actions on the first "column" of data. There may be duplicated data, and I want to make sure that I am only performing these actions once, and be able to restart the program and start where it left off.

My solution was to parse the line, store the value into a list, and then "If x not in list" perform actions, and then store that list as a pickle if/when the program dies/stops/is killed.

However the script segfaults after running for several hours, and I assume it is because I have burned through all my ram due to storing a lot of MD5 sums in the checkedFile list.

Because the file is so large (4million+ entries) I need something that is fast for lookups, hoping there is a simple solution. Thanks!

Data looks like this: 6e5f6c90e6bf31c31283afb8dca93da8|mouse.gif|10102017

My current code looks like this:

checkedFile = []

def readline(inFile):
print("Opening file: %s
" % inFile)
try:
    with open(inFile, "r") as inputFile:
        for line in inputFile: # read in file line by line
            fileHash, garbage = line.split("|",1) 
            if fileHash not in checkedFile:
                checkedFile.append(fileHash)
                checkit(fileHash) #

costrouc · Accepted Answer

One of the issues I would imagine you are having is the data structure that you have chosen. Since checkedFile is a list, checking if a hash is in the list have O(N) time complexity. The first suggestion I would have is use a python set. This uses a hash map which has a performance of O(1).

checkedFile = set()

def readline(inFile):
    print("Opening file: %s
" % inFile)
    with open(inFile, "r") as inputFile:
        for line in inputFile: # read in file line by line
            fileHash, garbage = line.split("|",1) 
            if fileHash not in checkedFile:
                checkedFile.add(fileHash)
                checkit(fileHash)

Python only really segfaults for 2 reasons so you definitely ran out of memory. I find it weird that you segfaulted with list storage because each hash is 16 bytes * 4 million = 64Mb which I would image is no where near the amount of ram in your typical computer.

Going the persistent storage route I would suggest sqlite. This approach cannot run out of memory since it will be stored on your hard drive.

 import sqlite3
 connection = sqlite3.connect('cache.db')
 CREATE_CACHE_TABLE = '''CREATE TABLE IF NOT EXISTS cache ( 
                         hash TEXT PRIMARY KEY
               )'''
 SELECT_STATEMENT = 'SELECT hash FROM cache WHERE hash = ?'
 INSERT_STATEMENT = 'INSERT INTO cache VALUES ?'

 with connection:
     connection.execute(CREATE_CACHE_TABLE)
 with open(inFile, "r") as inputFile:
      for line in inputFile:
          fileHash, garbage = line.split("|", 1)
          with connection:
               if not connection.execute(SELECT_STATMENT, (fileHash,)).fetchone():
                    connection.execute(INSERT_STATEMENT, (fileHash,))
                    checkit(fileHash)

As an added bonus you can also cache the results from checkit so that you can resume the calculation if it stops for some reason. This will also benifit you becuase a web request will take a minimum response time of 10ms. You mentioned that the web request returns a json response. I have written about this before sqlite with python how to store json in sqlite. You will need to modify the function checkit to return the json data.

import sqlite3
import json

def adapt_json(data):
    return (json.dumps(data, sort_keys=True)).encode()

def convert_json(blob):
    return json.loads(blob.decode())

sqlite3.register_adapter(dict, adapt_json)
sqlite3.register_converter('JSON', convert_json)

connection = sqlite3.connect('cache.db', detect_types=sqlite3.PARSE_DECLTYPES)
CREATE_CACHE_TABLE = '''CREATE TABLE IF NOT EXISTS cache (
                         hash TEXT PRIMARY KEY,
                         check_results JSON
               )'''
SELECT_STATEMENT = 'SELECT hash FROM cache WHERE hash = ?'
INSERT_STATEMENT = 'INSERT INTO cache VALUES ?, ?'

with connection:
    connection.execute(CREATE_CACHE_TABLE)
with open(inFile, "r") as inputFile:
    for line in inputFile:
        fileHash, garbage = line.split("|", 1)
        with connection:
            if not connection.execute(SELECT_STATMENT, (fileHash,)).fetchone():
                json_data = checkit(fileHash)
                connection.execute(INSERT_STATEMENT, (fileHash, json_data))

If you use this final approach. You can run this program in parallel with different chunks of the file and it will still work. Also this approach should scale very well if you end up with a larger file.

How to store (and access) a very large list in python

Answers (1)

Related Questions