Reputation: 71
I need to process a large file and to change some values.
I would like to do something like that:
for index, row in dataFrame.iterrows():
foo = doSomeStuffWith(row)
lol = doOtherStuffWith(row)
dataFrame['colx'][index] = foo
dataFrame['coly'][index] = lol
Bad for me, I cannot do dataFrame['colx'][index] = foo!
My number of row is quite large and I need to process a large number of column. So I'm afraid that dask may read the file several times if I do one dataFrame.apply(...) for each column.
Other solutions are to manually break my data into chunks and to use pandas or to just throw anything in a database. But it could be nice if I may keep using my .csv and let dask do the chunk processing for me!
Thank for your help.
Upvotes: 7
Views: 12074
Reputation: 2467
You can just use the same syntax as pandas, although it does evaluate the dask-dataframe as you go along.
for i in dask_df.iterrows():
print i
Upvotes: 0
Reputation: 57251
In general iterating over a dataframe, either Pandas or Dask, is likely to be quite slow. Additionally Dask won't support row-wise element insertion. This kind of workload is difficult to scale.
Instead I recommend using dd.Series.where (See this answer) or else doing your iteration in a function (after making a copy so as not to operate in place) and then using map_partitions to call that function across all of the Pandas dataframes in your Dask dataframe .
Upvotes: 4