Reputation: 77
i'm trying to use DASK package in Python 3.4 for avoid RAM problems with large datasets, but i've notice a problem.
Using native fucntion "read_csv" i load big dataset into a dask dataframe using less than 150MB of RAM.
The same dataset read with PANDAS DB Connection (using limit and offset options) and dask fuction"from_pandas" fill my ram uo to 500/750 MB.
I can't undestand why this happens and i want to fix this issue.
Here the code:
def read_sql(schema,tab,cond):
sql_count="""Select count(*) from """+schema+"""."""+tab
if (len(cond)>0):
sql_count+=""" where """+cond
a=pd.read_sql_query(sql_count,conn)
num_record=a['count'][0]
volte=num_record//10000
print(num_record)
if(num_record%10000>0):
volte=volte+1
sql_base="""Select * from """+schema+"""."""+tab
if (len(cond)>0):
sql_base+=""" where """+cond
sql_base+=""" limit 10000"""
base=pd.read_sql_query(sql_base,conn)
dataDask=dd.from_pandas(base, npartitions=None, chunksize=1000000)
for i in range(1,volte):
if(i%100==0):
print(i)
sql_query="""Select * from """+schema+"""."""+tab
if (len(cond)>0):
sql_query+=""" where """+cond
sql_query+=""" limit 10000 offset """+str(i*10000)
a=pd.read_sql_query(sql_query,conn)
b=dd.from_pandas(a , npartitions=None, chunksize=1000000)
divisions = list(b.divisions)
b.divisions = (None,)*len(divisions)
dataDask=dataDask.append(b)
return dataDask
a=read_sql('schema','tabella','data>\'2016-06-20\'')
Thanks for help me
Waiting for news
Upvotes: 2
Views: 1453
Reputation: 57301
One dask.dataframe is composed of many pandas dataframes or, as in the case of functions like read_csv
a plan to compute those dataframes on demand. It achieves low-memory execution by executing that plan to compute dataframes lazily.
When using from_pandas
the dataframes are already in memory, so there is little that dask.dataframe can do to avoid memory blowup.
In this case I see three solutions:
dask.dataframe.read_sql
function to lazily pull chunks of data from a database. At the time of writing this is in progress here: https://github.com/dask/dask/pull/1181dask.delayed
to achieve the same result in user code. See http://dask.pydata.org/en/latest/delayed.html and http://dask.pydata.org/en/latest/delayed-collections.html (this is my main suggestion in your case)Upvotes: 4