Reputation: 675
I would like to specify the dtypes returned when doing pandas.read_sql. In particular I am interested in saving memory and having float values returned as np.float32 instead of np.float64. I know that I can convert afterwards with astype(np.float32) but that doesn't solve the problem of the large memory requirements in the initial query. In my actual code, I will be pulling 84 million rows, not the 5 shown here. pandas.read_csv allows for specifying dtypes as a dict, but I see no way to do that with read_sql.
I am using MySQLdb and Python 2.7.
As an aside, read_sql seems to use far more memory while running (about 2x) than it needs for the final DataFrame storage.
In [70]: df=pd.read_sql('select ARP, ACP from train where seq < 5', connection)
In [71]: df
Out[71]:
ARP ACP
0 1.17915 1.42595
1 1.10578 1.21369
2 1.35629 1.12693
3 1.56740 1.61847
4 1.28060 1.05935
In [72]: df.dtypes
Out[72]:
ARP float64
ACP float64
dtype: object
Upvotes: 28
Views: 28518
Reputation: 11
Regarding @Ohad Bruker answer: It is since Pandas 2.0.0 also possbile to define dtype directly in read_sql.
Could be added as a comment to the answer but i am not authorized yet.
Upvotes: 1
Reputation: 1
CAST your data types in your SQL code. Python will adhere to the explicit data types in your SQL code. You can confirm this when doing df.info()
Upvotes: 0
Reputation: 349
As an aside, read_sql seems to use far more memory while running (about 2x) than it needs for the final DataFrame storage.
Maybe you can try our tool ConnectorX (pip install -U connectorx
), which is implemented in Rust and aims to improve the performance of pandas.read_sql
in terms of both time and memory usage, and provides similar interface. To switch to it, you only need to:
import connectorx as cx
conn_url = "mysql://username:password@server:port/database"
query = "select ARP, ACP from train where seq < 5"
df = cx.read_sql(conn_url, query)
The reason pandas.read_sql
uses a lot of memory during running is because of its large intermediate python objects, in ConnectorX
we use Rust and stream process to tackle this problem.
Here is some benchmark result:
Upvotes: 7
Reputation: 524
You can use pandas read_sql_query which allows you to specify the returned dtypes (supported only since pandas 1.3).
pd.read_sql_query('select ARP, ACP from train where seq < 5', connection,
dtype={'ARP': np.float32, 'ACP': np.float32})
Upvotes: 15
Reputation: 329
Take a look in this github issue, looks like they are inclined to add the option.
Upvotes: 2
Reputation: 428
What about cast() and convert()?
'SELECT cast(ARP as float32()), cast (ACP as float32()) from train where seq < 5'
or something similar.
http://www.smallsql.de/doc/sql-functions/system/convert.html
Upvotes: 3