Redshift to dask DataFrame

Question

Does anyone have a nice neat and stable way to achieve the equivalent of:

pandas.read_sql(sql, con, chunksize=None)

and/or

pandas.read_sql_table(table_name, con, schema=None, chunksize=None)

connected to redshift with SQLAlchemy & psycopg2, directly into a dask DataFrame ?

The solution should be able to handle large amounts of data

MRocklin · Accepted Answer

You might consider the read_sql_table function in dask.dataframe.

http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_sql_table

>>> df = dd.read_sql_table('accounts', 'sqlite:///path/to/bank.db',
...                  npartitions=10, index_col='id')  # doctest: +SKIP

This relies on the pandas.read_sql_table function internally, so should be able to operate with the same restrictions, except that now you're asked to provide a number of partitions and an index column.

Redshift to dask DataFrame

Answers (1)

Related Questions