Reputation: 1311
So I have a table with 146 columns and approx. 8 mil rows of sparse data stored locally into a Postgresql.
My goal is to select the whole dataset at once, store it into a pandas dataframe and perform some calculations.
So far I have read about server side cursors in many threads but i guess I'm doing something wrong as I don't see improvement in memory. The documentation is also quite limited..
My code so far is the following:
cur=conn.cursor('testCursor')
cur.itersize = 100000
cur.execute("select * from events")
df = cur.fetchall()
df = pd.DataFrame(df)
conn.commit()
conn.close()
I also tried using fetchmany() or fetchone() instead of fetchall() but I don't know how to scroll the results. I guess I could use something like this for fetchone() but I don't know how to handle fetchmany():
df = cur.fetchone()
while row:
row = cur.fetchone()
Lastly, in case of fetchone() and fetchmany() how can I concat the results into a single dataframe without consuming all of my memory? Just to note that I have 16gb available RAM
Upvotes: 0
Views: 1801
Reputation: 12563
8 mil rows x 146 columns (assuming that a column stores at least one byte) would give you at least 1 GB. Considering that your columns probably store more than a byte per column, even if you would succeed with the first step of what you try to do, you would hit RAM constraints (e.g. the end result won't fit in RAM).
The usual strategy to process large datasets is processing them in small batches and then (if needed) combine the results. Have a look at PySpark, for example.
Upvotes: 1