ResrieC
ResrieC

Reputation: 583

SQLAlchemy Core bulk insert slow

I'm trying to truncate a table and insert only ~3000 rows of data using SQLAlchemy, and it's very slow (~10 minutes).

I followed the recommendations on this doc and leveraged sqlalchemy core to do my inserts, but it's still running very very slow. What are possible culprits for me to look at? Database is a postgres RDS instance. Thanks!

engine = sa.create_engine(db_string, **kwargs, pool_recycle=3600)
with engine.begin() as conn:
            conn.execute("TRUNCATE my_table")
            conn.execute(
                MyTable.__table__.insert(),
                data #where data is a list of dicts
            )

Upvotes: 2

Views: 8443

Answers (2)

Alex Sidorov
Alex Sidorov

Reputation: 99

Some time ago I had been struggling with the problem while working in the company, so we had created a library with functions for bulk insert and update. Hope we've taken into account all performance and security concerns. This library is open-sourced and available on PyPI, its name: bulky.

Let me show you some examples of usage:

insert:

import bulky
from your.sqlalchemy.models import Model
from your.sqlalchemy.session import Session

data = [
    {Model.column_float: random()}
    for _ in range(100_000_000)
]

rows_inserted = bulky.insert(
    session=Session,
    table_or_model=Model,
    values_series=data,
    returning=[Model.id, Model.column_float]
)

new_items = {row.id: row.column_float for row in rows_inserted}

update:

import bulky
from your.sqlalchemy.models import ManyToManyTable
from your.sqlalchemy.session import Session

data = [
    {
        ManyToManyTable.fk1: i,
        ManyToManyTable.fk2: j,
        ManyToManyTable.value: i + j,
    }
    for i in range(100_000_000)
    for j in range(100_000_000)
]

rows_updated = bulky.update(
    session=Session,
    table_or_model=ManyToManyTable,
    values_series=data,
    returning=[
        ManyToManyTable.fk1,
        ManyToManyTable.fk2,
        ManyToManyTable.value,],
    reference=[
        ManyToManyTable.fk1,
        ManyToManyTable.fk2,],
)

updated_items = {(row.fk1, row.fk2): row.value for row in rows_updated}

Not sure if links are allowed, so I'll put them under spoiler

Readme and PyPI

Upvotes: 3

dizzyf
dizzyf

Reputation: 3693

I was bummed when I saw this didn't have an answer... I ran into the exact same problem the other day: Trying to bulk-insert about millions of rows to a Postgres RDS Instance using CORE. It was taking hours.

As a workaround, I ended up writing my own bulk-insert script that generated the raw sql itself:

bulk_insert_str = []
for entry in entry_list:
    val_str = "('{}', '{}', ...)".format(entry["column1"], entry["column2"], ...)
    bulk_insert_str.append(val_str)

engine.execute(
    """
    INSERT INTO my_table (column1, column2 ...)
    VALUES {}
    """.format(",".join(bulk_insert_str))
)

While ugly, this gave me the performance we needed (~500,000 rows/minute)

Did you find a CORE-based solution? If not, hope this helps!

UPDATE: Ended up moving my old script into a spare EC2 instance that we weren't using which actually fixed the slow performance issue. Not sure what your setup is, but apparently there's a network overhead in communicating with RDS from an external (non-AWS) connection.

Upvotes: 8

Related Questions