user5903421
user5903421

Reputation:

Reading large amount of data from Access database

Looking for advice on how to solve my specific problem (MemoryError due to storing too much info in one variable), as well as general advice on different ways I could approach the problem.


I have an Access 1997 database that I am trying to extract data from. Since I have Access 2013 installed, I cannot open the database without downloading Access 2003. No problem -- I can use pyodbc and Jet to extract using python.

I made a pyodbc cursor connection to the database and wrote this function to first query for all table names, then all columns associated with those tables:

def get_schema(cursor):
    """
    :param cursor: Cursor object to database
    :return: Dictionary with table name as key and list of columns as value
    """
    db_schema = dict()
    tbls = cursor.tables().fetchall()

    for tbl in tbls:
        if tbl not in db_schema:
            db_schema[tbl] = list()
        column_names = list()
        for col in cursor.columns(table=tbl):
            column_names.append(col[3])
        db_schema[tbl].append(tuple(column_names))

    return db_schema

The variable I get back looks something like this:

{'Table 1': [('Column 1-1', 'Column 1-2', 'Column 1-3')],
 'Table 2': [('Column 2-1', 'Column 2-2')]}

I then pass that schema variable to another function to dump the data from every table into a list of tuples:

def get_table_data(cursor, schema):

    for tbl, cols in schema.items():

        sql = "SELECT * from %s" % tbl  # Dump data
        cursor.execute(sql)  
        col_data = cursor.fetchall()

        for row in col_data:
            cols.append(row)

    return schema

However, when I try to read the returned variable, I get the following:

>>> schema2 = get_table_data(cursor, schema)
>>> schema2
Traceback (most recent call last):
  File "<input>", line 1, in <module>
MemoryError

TL;DR: Is there a way to start storing data in another variable when it gets too big to read? Or a way to increase the memory allocation? Ultimately, I'd like to dump this into a csv file or something similar -- is there a more direct way to go about this?

Upvotes: 0

Views: 1999

Answers (1)

John Montgomery
John Montgomery

Reputation: 9058

You probably want to be able to stream the data out of the database, rather than loading it in all at once. That way you can write the data straight back out without having too much of it loaded into memory at once.

The best approach for this would be using generators.

So rather than modifying the schema variable like you currently are, you'll instead yield the various rows as you read them from the database tables:

def get_single_table_data(cursor, tbl):
    '''
    Generator to get all data from one table.
    Does this one row at a time, so we don't load
    too much data in at once
    '''
    sql = "SELECT * from %s" % tbl
    cursor.execute(sql)
    while True:
        row = cursor.fetchone()
        if row is None:
            break
        yield row

def print_all_table_data(cursor, schema):
    for tbl, cols in schema.items():
        print(cols)
        rows = get_single_table_data(cursor, tbl)
        for row in rows:
            print(row)

That's obviously just an example, but it would (in theory) print out every row in all the tables - without having more than a row of data in memory at a time.

Upvotes: 4

Related Questions