Reputation: 11746
When I run a loop like this (see below) using dask and pandas, only the last field in the list gets evaluated. Presumably this is because of "lazy-evaluation"
import pandas as pd
import dask.dataframe as ddf
df_dask = ddf.from_pandas(df, npartitions=16)
for field in fields:
df_dask["column__{field}".format(field=field)] = df_dask["column"].apply(lambda _: [__ for __ in _ if (__ == field)], meta=list)
If I add .compute()
to the last line:
df_dask["column__{field}".format(field=field)] = df_dask["column"].apply(lambda _: [__ for __ in _ if (__ == field)], meta=list).compute()
it then works correctly, but is this the most efficient way of doing this operation? Is there a way for Dask to add all the items from the fields list at once, and then run them in one-shot via compute()
?
edit ---------------
Please see screenshot below for a worked example
Upvotes: 1
Views: 3409
Reputation: 11
Here's one way to do it, where string check is just a sample function that returns True/False. The issue was the late binding of lambda functions.
from functools import partial
def string_check(string, search):
return search in string
search_terms = ['foo', 'bar']
for s in search_terms:
string_check_partial = partial(string_check, search=s)
df[s] = df['YOUR_STRING_COL'].apply(string_check_partial)
Upvotes: 0
Reputation: 57271
You will want to call .compute()
at the end of your computation to trigger work. Warning: .compute assumes that your result will fit in memory
Also, watch out, lambdas late-bind in Python, so the field
value may end up being the same for all of your columns.
Upvotes: 2