Reputation: 150
For whatever reason, I'm able to easily read data from a postgres database using the pandas read_sql
method, but even with exactly the same parameters df.to_sql
doesn't work. The operation seems to start, but it never finishes. (I've tried waiting for about 10 minutes.) The df itself is only about 50k rows and 20 columns. Here's what I've been doing:
from sqlalchemy import create_engine
from getpass import getpass
p = getpass("Input password")
in_string ='postgresql://<USER>:{}@<SERVER>:5432/<DB>'.format(p)
engine = create_engine(in_string)
df.reset_index().to_sql('<TABLE_NAME>', engine, schema='<SCHEMA>', index=False, chunksize=500)
Another interesting tidbit is that a new table is created in the database, it's just not populated with any data whatsoever. The column names and datatypes all seem to be correct though.
I've tried changing chunksizes, reindexing, and just waiting. This is very frustrating also because there's no error. Has anyone else encountered this problem with pandas to_sql
? Any ideas what could be causing this issue?
EDIT:
Minutes after posting this, it finished. In total, I think it took about 20 minutes and the index was passed as a column despite the index=False
argument. The question remains. I think this is slow because under the hood pandas inserts every single row individually? Are there any faster alternatives I could use from within python?
EDIT2: I should have said this previously. I'm trying to write a script to iterate through a list of csv files (about 20) and "copy" them into non-existent tables in a psql database. The nice thing about pandas is that it will infer the dtypes in an intelligent way and translate that into sql's arcane (at least to me) data types, create the tables, and then insert the data.
Upvotes: 2
Views: 1758
Reputation: 150
After spending a lot of time first trying to write a bash script that implements csv kit to determine data types for a psql CREATE TABLE
command and then copies the data in, I decided the best solution is still to go the pandas to_sql
route.
There's currently an existing issue here detailing the problems with how pandas deals with insert statements one row at a time.
From the link above, I executed the following simple monkey patch (thanks to user nhockam) before importing pandas. This seems to speed things up substantially.
from pandas.io.sql import SQLTable
def _execute_insert(self, conn, keys, data_iter):
print("Using monkey-patched _execute_insert")
data = [dict((k, v) for k, v in zip(keys, row)) for row in data_iter]
conn.execute(self.insert_statement().values(data))
SQLTable._execute_insert = _execute_insert
Although I didn't time it, it looks like other users have on the issue page. For now, this seems like a relatively okay solution for batch importing csv files with python via pandas.
Upvotes: 1