Reputation: 1667
working in postgresql I have a cartesian join producing ~4 million rows. The join takes ~5sec and the write back to the DB takes ~1min 45sec.
The data will be required for use in python, specifically in a pandas dataframe, so I am experimenting with duplicating this same data in python. I should say here that all these tests are running on one machine, so nothing is going across a network.
Using psycopg2 and pandas, reading in the data and performing the join to get the 4 million rows (from an answer here:cartesian product in pandas) takes consistently under 3 secs, impressive.
Writing the data back to a table in the database however takes anything from 8 minutes (best method) to 36+minutes (plus some methods I rejected as I had to stop them after >1hr).
While I was not expecting to reproduce the "sql only" time, I would hope to be able to get closer than 8 minutes (I`d have thought 3-5 mins would not be unreasonable).
Slower methods include:
36min - sqlalchemy`s table.insert (from 'test_sqlalchemy_core' here https://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow)
13min - psycopg2.extras.execute_batch (https://stackoverflow.com/a/52124686/3979391)
13-15min (depends on chunksize) - pandas.dataframe.to_sql (again using sqlalchemy) (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html)
Best way (~8min) is using psycopg2`s cursor.copy_from method (found here: https://github.com/blaze/odo/issues/614#issuecomment-428332541). This involves dumping the data to a csv first (in memory via io.StringIO), that alone takes 2 mins.
So, my questions:
Anyone have any potentially faster ways of writing millions of rows from a pandas dataframe to postgresql?
The docs for the cursor.copy_from method (http://initd.org/psycopg/docs/cursor.html) state that the source object needs to support the read() and readline() methods (hence the need for io.StringIO). Presumably, if the dataframe supported those methods, we could dispense with the write to csv. Is there some way to add these methods?
Thanks. Giles
On Q2 - pandas can now use a custom callable for to_sql and the given example here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method does pretty much what I suggest above (IE it copies csv data directly from STDIN using StringIO). I found an ~40% increase in write speed using this method, which brings to_sql close to the "best" method mentioned above.
Upvotes: 3
Views: 2206
Reputation: 1667
Answering Q 1 myself: It seems the issue had more to do with Postgresql (or rather Databases in general). Taking into account points made in this article:https://use-the-index-luke.com/sql/dml/insert I found the following:
1) Removing all indexes from the destination table resulted in the query running in 9 seconds. Rebuilding the indexes (in postgresql) took a further 12 seconds, so still well under the other times.
2) With only a primary key in place, Inserting rows ordered by the primary key columns reduced the time taken to about a third. This makes sense as there should be little or no shuffling of the index rows required. I also verified that this is the reason why my cartesian join in postgresql was faster in the first place (IE the rows were ordered by the index, purely by chance), placing the same rows in a temporary table (unordered) and inserting from that actually took allot longer.
3) I tried similar experiments on our mysql systems and found the same increase in insert speed when removing indexes. With mysql however it seemed that rebuilding the indexes used up any time gained.
I hope this helps anyone else who comes across this question from a search.
I still wonder if it is possible to remove the write to csv step in python (Q2 above) as I believe I could then write something in python that would be faster than pure postgresql.
Thanks, Giles
Upvotes: 1