SolverWorld
SolverWorld

Reputation: 675

Specifying dtypes for read_sql in pandas

I would like to specify the dtypes returned when doing pandas.read_sql. In particular I am interested in saving memory and having float values returned as np.float32 instead of np.float64. I know that I can convert afterwards with astype(np.float32) but that doesn't solve the problem of the large memory requirements in the initial query. In my actual code, I will be pulling 84 million rows, not the 5 shown here. pandas.read_csv allows for specifying dtypes as a dict, but I see no way to do that with read_sql.

I am using MySQLdb and Python 2.7.

As an aside, read_sql seems to use far more memory while running (about 2x) than it needs for the final DataFrame storage.

In [70]: df=pd.read_sql('select ARP, ACP from train where seq < 5', connection)

In [71]: df
Out[71]: 
   ARP      ACP
0  1.17915  1.42595
1  1.10578  1.21369
2  1.35629  1.12693
3  1.56740  1.61847
4  1.28060  1.05935


In [72]: df.dtypes
Out[72]: 
ARP    float64
ACP    float64
dtype: object

Upvotes: 28

Views: 28518

Answers (6)

korzi
korzi

Reputation: 11

Regarding @Ohad Bruker answer: It is since Pandas 2.0.0 also possbile to define dtype directly in read_sql.

Could be added as a comment to the answer but i am not authorized yet.

Upvotes: 1

OscarG
OscarG

Reputation: 1

CAST your data types in your SQL code. Python will adhere to the explicit data types in your SQL code. You can confirm this when doing df.info()

Upvotes: 0

Xiaoying Wang
Xiaoying Wang

Reputation: 349

As an aside, read_sql seems to use far more memory while running (about 2x) than it needs for the final DataFrame storage.

Maybe you can try our tool ConnectorX (pip install -U connectorx), which is implemented in Rust and aims to improve the performance of pandas.read_sql in terms of both time and memory usage, and provides similar interface. To switch to it, you only need to:

import connectorx as cx
conn_url = "mysql://username:password@server:port/database"
query = "select ARP, ACP from train where seq < 5"
df = cx.read_sql(conn_url, query)

The reason pandas.read_sql uses a lot of memory during running is because of its large intermediate python objects, in ConnectorX we use Rust and stream process to tackle this problem.

Here is some benchmark result:

  • PostgreSQL: postgres memory

  • MySQL: mysql memory

Upvotes: 7

Ohad Bruker
Ohad Bruker

Reputation: 524

You can use pandas read_sql_query which allows you to specify the returned dtypes (supported only since pandas 1.3).

pd.read_sql_query('select ARP, ACP from train where seq < 5', connection,
                  dtype={'ARP': np.float32, 'ACP': np.float32})

Upvotes: 15

mooli
mooli

Reputation: 329

Take a look in this github issue, looks like they are inclined to add the option.

Upvotes: 2

Dylan
Dylan

Reputation: 428

What about cast() and convert()?

'SELECT cast(ARP as float32()), cast (ACP as float32()) from train where seq < 5'

or something similar.

http://www.smallsql.de/doc/sql-functions/system/convert.html

Upvotes: 3

Related Questions