Reading large tables into Pandas, is there a intermediate step?

Question

I have a data-analysis script that I am putting together. This script connects to Teradata, Select * from the table, and loads that into the pandas dataframe.

import teradata
import pandas as pd

with udaExec.connect(method="xxx", dsn="xxx", username="xxx", password="xxx") as session:

    query = "Select * from TableA"

    # read in records
    df = pd.read_sql(query, session)

    # misc pandas tests below...

This works great for tables with 100k records or less, but the problem is that many tables have far more records than that (millions and millions of records), and it just tends to run indefinitely.

Is there some intermediate step I can take? I've been researching and I see something about copying the DB table to a .csv file or .txt file or something first, and then loading the pandas dataframe from that(instead of loading from the table itself), but I can't make sense of it.

Any advice would be appreciated! Thanks.

Matthias Fripp · Accepted Answer

In a comment I promised to provide some code that can read a table from a server quickly into a local CSV file, then read that CSV file into a Pandas dataframe. Note that this code is written for postgresql, but you could probably adapt it pretty easily for other databases.

Here is the code:

from cStringIO import StringIO
import psycopg2
import psycopg2.sql as sql
import pandas as pd

database = 'my_db'
pg_host = 'my_postgres_server'
table = 'my_table'
# note: you should also create a ~/.pgpass file with the credentials needed to access
# this server, e.g., a line like "*:*:*:username:password" (if you only access one server)

con = psycopg2.connect(database=database, host=pg_host)
cur = con.cursor()    

# Copy data from the database to a dataframe, using psycopg2 .copy_expert() function.
csv = StringIO()  # or tempfile.SpooledTemporaryFile()
# The next line is the right way to insert a table name into a query, but it requires 
# psycopg2 >= 2.7. See here for more details: https://stackoverflow.com/q/13793399/3830997
copy_query = sql.SQL("COPY {} TO STDOUT WITH CSV HEADER").format(sql.Identifier(table))
cur.copy_expert(copy_query, csv)
csv.seek(0)  # move back to start of csv data
df = pd.read_csv(csv)

Here also is some code that writes large dataframes to the database via the CSV route:

csv = StringIO()
df.to_csv(csv, index=False, header=False)
csv.seek(0)
try:
    cur.copy_from(csv, table, sep=',', null='\N', size=8192, columns=list(df.columns))
    con.commit()
except:
    con.rollback()
    raise

I tested this code over my 10 Mbps office network (don't ask!) with a 70,000 row table (5.3 MB as a CSV).

When reading a table from the database, I found that the code above was about 1/3 faster than pandas.read_sql() (5.5s vs. 8s). I'm not sure that would justify the extra complexity in most cases. This is probably about as fast as you can get -- postgresql's COPY TO ... command is very fast, and so is Pandas' read_csv.

When writing a dataframe to the database, I found that using a CSV file (the code above) was about 50x faster than using pandas' df.to_sql() (5.8s vs 288s). This is mainly because Pandas doesn't use multi-row inserts. This seems to have been a subject of active discussion for several years -- see https://github.com/pandas-dev/pandas/issues/8953 .

A couple of notes about chunksize: this may not do what most users expect. If you set chunksize in pandas.read_sql(), the query still runs as one command, but the results are returned to your program in batches; this is done with an iterator that yields each chunk in turn. If you use chunksize in pandas.to_sql(), it causes the inserts to be done in batches, reducing memory requirements. However, at least on my system, each batch is still broken down into individual insert statements for each row, and those take a long time to run.

UPDATE: Pandas added a method argument to .to_sql() soon after I wrote this, which may do exactly what you want. Check the documentation for the 'multi' and callable options.

Reading large tables into Pandas, is there a intermediate step?

Answers (2)

Related Questions