Reputation: 5373
I have been fetching data from a Postgres database using Python. And it is taking up a lot of memory. As can be seen below:
The following function is the only function that I am running, and it is taking up an excessive amount of memory. I am using fetchmany()
and fetching data in small chunks. I have also tried to use the cur
cursor iteratively. However, all of these methods end up with really excessive amounts of memory usage. Does anyone have any clue as to why This is happening? Is there anything that I need to tune at the Postgres end that can help mitigate this problem??
def checkMultipleLine(dbName):
'''
Checks for rows that contain data spanning multiple lines
This is the most basic of checks. If a aprticular row has
data that spans multiple lines, then that particular row
is corrupt. For dealing with these rows we must first find
out whether there are places in the database that contains
data that spans multiple lines.
'''
logger = logging.getLogger('mindLinc.checkSchema.checkMultipleLines')
logger.info('Finding rows that span multiple lines')
schema = findTables(dbName)
results = []
for t in tqdm(sorted(schema.keys())):
conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
cur = conn.cursor()
cur.execute('select * from %s'%t)
n = 0
N = 0
while True:
css = cur.fetchmany(1000)
if css == []: break
for cs in css:
N += 1
if any(['\n' in c for c in cs if type(c)==str]):
n += 1
cur.close()
conn.close()
tqdm.write('[%40s] -> [%5d][%10d][%.4e]'%(t, n, N, n/(N+1.0)))
results.append({
'tableName': t,
'totalRows': N,
'badRows' : n,
})
logger.info('Finished checking for multiple lines')
results = pd.DataFrame(results)[['tableName', 'badRows', 'totalRows']]
print results
results.to_csv('error_MultipleLine[%s].csv'%(dbName), index=False)
return results
Upvotes: 2
Views: 1595
Reputation: 5454
Psycopg2 supports server-side cursors to be used for large queries as stated in this answer. Here is how to use it with client-side buffer setting:
cur = conn.cursor('cursor-name')
cur.itersize = 10000 # records to buffer on a client
That should reduce the memory footprint.
Upvotes: 2