Reputation: 51
I have a rather large table in PostgreSQL. When executing
select pg_size_pretty(pg_total_relation_size('my_schema.my_table'));
in PostgreSQL I get a table size of 2048 MB. My PC has 16 GB of RAM, AMD CPU Ryzen 7 pro 4750G and runs Ubuntu 20.04.
I establish a connection from Python to PostgreSQL by using the module psycopg2 and retrieve the data by using Pandas. It's a simple piece of code:
db_conn = psycopg2.connect(host = 'localhost', database = 'my_db', user = 'user_name', password = 'my_passwort')
stmt = "select order_name, order_timestamp::date, col1, col12 from my_schema.my_table;"
data_df = pd.io.sql.read_sql(stmt, db_conn)
In the beginning, my RAM usage lies at around 2.5 GB. But then, when I try to retrieve the data with the last statement, it starts increasing, eventually reaching 16 GB, and then my Python terminal is closed with the message "Killed".
Can anyone explain why is that? I had around 13.5 GB of free RAM, the table to be read was around 2 GB and yet my RAM usage eventually went to 100% and the execution was aborted.
I also tried data_df = pd.read_sql(stmt, db_conn)
for reading the table (not sure what the difference is). That had the same result.
Eventually, after some googling, I found an alternative by creating a temporary file, where essentially the last line from above is substituted by
with tempfile.TemporaryFile() as tmpfile:
copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(query = stmt, head="HEADER")
cur = db_conn.cursor()
cur.copy_expert(copy_sql, tmpfile)
tmpfile.seek(0)
data_df = pd.read_csv(tmpfile)
That works like a charm but I don't understand why. The data frame data_df is still a bit larger than expected (2.5 GB), but nevertheless much smaller than before.
Any ideas why the second method works and the first fails?
What is the purpose of tmpfile.seek(0) and where exactly is the temporary file stored and under what name? It seems to me that PostgreSQL creates it but there is no name assigned (ending with '.csv'), just the Python designation (here tmpfile). That example really bugs me, since I don't understand what happens and why the code works, so hopefully, someone can shed some light on that for me.
Upvotes: 4
Views: 1653
Reputation: 25220
Any ideas why the second method works and the first fails?
My best guess is that the Postgres database driver is representing the table in an inefficient intermediate format, and the database driver is loading the entire table into memory before attempting to convert the intermediate representation into numpy arrays. I would guess that you're running out of memory at this step.
To check this theory, you can try using chunksize to read the table in smaller chunks and concat() them all together.
It seems to me that PostgreSQL creates it
No, Python is creating it, and doing all reads/writes to the file. See the tempfile module.
where exactly is the temporary file stored and under what name?
The temporary file is usually stored in /tmp. The temporary file doesn't have a name. Python creates the file, opens it, then deletes it. In Linux, if you delete a file, the file is not really deleted until all file descriptors to it are closed. Therefore, if you want a file which is automatically deleted when a program exits, no matter what, deleting the file after opening it is an effective method.
If you want the file to have a name, you need to use tempfile.NamedTemporaryFile. If you use a named temporary, you can print the name like this:
with tempfile.NamedTemporaryFile() as f:
print(f.name)
What is the purpose of tmpfile.seek(0)?
When you read or write to a file, you have a "position" within that file. Reading or writing advances the position. After Postgres writes the contents of the table into that file, then the position is at the end. You want to set the position to the beginning. Therefore, you seek to zero. (This number is relative to the beginning of the file.) Documentation.
You don't normally see this in Python code because you're usually just reading or just writing to the file, not both.
As a closing remark, thank you for posting this question. I've never seen that technique before for dealing with large tables, so thanks for teaching me something!
Upvotes: 1