Reputation: 381
This is more of a question on understanding than programming. I am quite new to Pandas and SQL. I am using pandas to read data from SQL with some specific chunksize. When I run a sql query e.g. import pandas as pd
df = pd.read_sql_query('select name, birthdate from table1', chunksize = 1000)
What I do not understand is when I do not give a chunksize, data is stored in the memory and I can see the memory growing however, when I give a chunksize the memory usage is not that high.
I have is that this df now contains a number of arrays which I can access as
for df_array in df:
print df.head(5)
What I do not understand here is if the entire result of the SQL statement is kept in memory i.e. df is an object carrying multiple arrays or if these are like pointers pointing towards a temp table created by SQL query.
I would be very glad to develop some understanding about how this process is actually working.
Upvotes: 35
Views: 64125
Reputation: 704
Let's consider two options and what happens in both cases:
For more details you can see pandas\io\sql.py module, it is well documented
Upvotes: 53
Reputation: 338
Its basically there to stop your server from running out of memory when you have a massive query.
Out to CSV
for chunk in pd.read_sql_query(sql , con, chunksize=10000):
chunk.to_csv(os.path.join(tablename + ".csv"), mode='a',sep=',',encoding='utf-8')
or Out to Parquet
count = 0
folder_path = 'path/to/output'
for chunk in pd.read_sql_query(sql , con, chunksize=10000):
file_path = folder_path + '/part.%s.parquet' % (count)
chunk.to_parquet(file_path, engine='pyarrow')
count += 1
Upvotes: 9
Reputation: 139222
When you do not provide a chunksize
, the full result of the query is put in a dataframe at once.
When you do provide a chunksize
, the return value of read_sql_query
is an iterator of multiple dataframes. This means that you can iterate through this like:
for df in result:
print df
and in each step df
is a dataframe (not an array!) that holds the data of a part of the query. See the docs on this: http://pandas.pydata.org/pandas-docs/stable/io.html#querying
To answer your question regarding memory, you have to know that there are two steps in retrieving the data from the database: execute
and fetch
.
First the query is executed (result = con.execute()
) and then the data are fetched from this result set as a list of tuples (data = result.fetch()
). When fetching you can specify how many rows at once you want to fetch. And this is what pandas does when you provide a chunksize
.
But, many database drivers already put all data into memory in the execute step, and not only when fetching the data. So in that regard, it should not matter much for the memory. Apart from the fact the copying of the data into a DataFrame only happens in different steps while iterating with chunksize
.
Upvotes: 31