Reputation: 1687
I have a table generated on a server and I connect to it using a presto client as follows:
conn = presto.connect('hostname',port)
db = "some_large_table"
What I would like to do is to read in 1 chunk at a time then do my processing and append that chunk to an existing df. Ie:
sql = "select column1, .. column20 limit 100"
chunk_count = 0
dfs = []
for chunk in pd.read_sql_query(sql, conn, chunksize=10) :
chunk_count +=1
dfs.append(chunk)
print(dfs[['column1', 'column2']])
The print dfs only shows 10 rows. Which Means its not appending.
Upvotes: 1
Views: 1835
Reputation: 79
Appending to dataframes is really slow and should be avoided. Pandas does not do in-place appending. The dataframe is always copied to a new version.
So in your code, you could do this:
dfs = pd.DataFrame() #empty dataframe
# then in the loop:
dfs = dfs.append(chunk)
And that would work. If you don't have very many chunks, then it is not bad. But as dfs grows, it will start to slow down to a crawl.
I find it is best to append lod (list of dict) and then cast all at once into a dataframe, if you want one big dataframe.
Thus:
sql = "select column1, .. column20 limit 100"
chunk_count = 0
lod = []
for chunk_df in pd.read_sql_query(sql, conn, chunksize=10) :
chunk_count += 1
lod += chunk_df.to_dict(orient='records')
dfs = pd.DataFrame.from_dict(lod)
print(f"processed {chunk_count} chunks.\n", dfs[['column1', 'column2']])
This method does not get slower as you go, because lists can append quickly, and then the lod is converted to dfs in one shot.
Upvotes: 1
Reputation: 1687
In my query I limited the number of rows to 10. For some reason df_full.append() does not work, I changed it to df_full = df_full.append() and it works fine.
sql = "select*...limit 10"
df_source = pd.read_sql_query(sql, conn, chunksize=2)
chunk_count = 0
df_list = []
df_full = pd.DataFrame(columns = col_names)
for chunk in df_source:
df_list.append(chunk)
for df_item in df_list:
df_full = df_full.append(df_item, ignore_index = True)
print(df_full)
Result:
[10 rows x 38 columns]
Upvotes: 1
Reputation: 11080
Well I can't be sure that this will work without more context but I can tell you that your issue arises because dfs is a list of data frames not a data frame... That said with this approach you will assign dfs to be equal to your first query and append subsequent querys to that result.
sql = "select column1, .. column20 limit 100"
chunk_count = 0
for chunk in pd.read_sql_query(sql, conn, chunksize=10) :
chunk_count +=1
try:
dfs.append(chunk)
except:
dfs = chunk
print(dfs[['column1', 'column2']])
Upvotes: -1