Problems reading a large PostgreSQL table into a Pandas dataframe

Question

I have a rather large table in PostgreSQL. When executing
select pg_size_pretty(pg_total_relation_size('my_schema.my_table'));
in PostgreSQL I get a table size of 2048 MB. My PC has 16 GB of RAM, AMD CPU Ryzen 7 pro 4750G and runs Ubuntu 20.04.
I establish a connection from Python to PostgreSQL by using the module psycopg2 and retrieve the data by using Pandas. It's a simple piece of code:

db_conn = psycopg2.connect(host = 'localhost', database = 'my_db', user = 'user_name', password = 'my_passwort')
stmt = "select order_name, order_timestamp::date, col1, col12 from my_schema.my_table;"
data_df = pd.io.sql.read_sql(stmt, db_conn)

In the beginning, my RAM usage lies at around 2.5 GB. But then, when I try to retrieve the data with the last statement, it starts increasing, eventually reaching 16 GB, and then my Python terminal is closed with the message "Killed".
Can anyone explain why is that? I had around 13.5 GB of free RAM, the table to be read was around 2 GB and yet my RAM usage eventually went to 100% and the execution was aborted.
I also tried data_df = pd.read_sql(stmt, db_conn) for reading the table (not sure what the difference is). That had the same result.
Eventually, after some googling, I found an alternative by creating a temporary file, where essentially the last line from above is substituted by

with tempfile.TemporaryFile() as tmpfile:
    copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(query = stmt, head="HEADER")
    cur = db_conn.cursor()
    cur.copy_expert(copy_sql, tmpfile)
    tmpfile.seek(0)
    data_df = pd.read_csv(tmpfile)

That works like a charm but I don't understand why. The data frame data_df is still a bit larger than expected (2.5 GB), but nevertheless much smaller than before.
Any ideas why the second method works and the first fails?
What is the purpose of tmpfile.seek(0) and where exactly is the temporary file stored and under what name? It seems to me that PostgreSQL creates it but there is no name assigned (ending with '.csv'), just the Python designation (here tmpfile). That example really bugs me, since I don't understand what happens and why the code works, so hopefully, someone can shed some light on that for me.

Problems reading a large PostgreSQL table into a Pandas dataframe

Answers (1)

Related Questions