ssm
ssm

Reputation: 5373

Excessive memory usage while getting data from a Postgres database

I have been fetching data from a Postgres database using Python. And it is taking up a lot of memory. As can be seen below:

memory usage

The following function is the only function that I am running, and it is taking up an excessive amount of memory. I am using fetchmany() and fetching data in small chunks. I have also tried to use the cur cursor iteratively. However, all of these methods end up with really excessive amounts of memory usage. Does anyone have any clue as to why This is happening? Is there anything that I need to tune at the Postgres end that can help mitigate this problem??

def checkMultipleLine(dbName):
    '''
    Checks for rows that contain data spanning multiple lines

    This is the most basic of checks. If a aprticular row has 
    data that spans multiple lines, then that particular row
    is corrupt. For dealing with these rows we must first find 
    out whether there are places in the database that contains
    data that spans multiple lines. 
    '''

    logger = logging.getLogger('mindLinc.checkSchema.checkMultipleLines')
    logger.info('Finding rows that span multiple lines')

    schema = findTables(dbName)

    results = []
    for t in tqdm(sorted(schema.keys())):

        conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
        cur  = conn.cursor()
        cur.execute('select * from %s'%t)
        n = 0
        N = 0
        while True:
            css = cur.fetchmany(1000)
            if css == []: break
            for cs in css:
                N += 1
                if any(['\n' in c for c in cs if type(c)==str]):
                    n += 1
        cur.close()
        conn.close()

        tqdm.write('[%40s] -> [%5d][%10d][%.4e]'%(t, n, N, n/(N+1.0)))
        results.append({
            'tableName': t,
            'totalRows': N,
            'badRows'  : n,
        })


    logger.info('Finished checking for multiple lines')

    results = pd.DataFrame(results)[['tableName', 'badRows', 'totalRows']]
    print results
    results.to_csv('error_MultipleLine[%s].csv'%(dbName), index=False)

    return results

Upvotes: 2

Views: 1595

Answers (1)

hurturk
hurturk

Reputation: 5454

Psycopg2 supports server-side cursors to be used for large queries as stated in this answer. Here is how to use it with client-side buffer setting:

cur = conn.cursor('cursor-name')
cur.itersize = 10000  # records to buffer on a client

That should reduce the memory footprint.

Upvotes: 2

Related Questions