vgoklani
vgoklani

Reputation: 11746

How to properly iterate over a for loop using Dask?

When I run a loop like this (see below) using dask and pandas, only the last field in the list gets evaluated. Presumably this is because of "lazy-evaluation"

import pandas as pd
import dask.dataframe as ddf

df_dask = ddf.from_pandas(df, npartitions=16)

for field in fields:
    df_dask["column__{field}".format(field=field)] = df_dask["column"].apply(lambda _: [__ for __ in _ if (__ == field)], meta=list)

If I add .compute() to the last line:

df_dask["column__{field}".format(field=field)] = df_dask["column"].apply(lambda _: [__ for __ in _ if (__ == field)], meta=list).compute()

it then works correctly, but is this the most efficient way of doing this operation? Is there a way for Dask to add all the items from the fields list at once, and then run them in one-shot via compute()?

edit ---------------

Please see screenshot below for a worked example

enter image description here

Upvotes: 1

Views: 3409

Answers (2)

Todd Morrill
Todd Morrill

Reputation: 11

Here's one way to do it, where string check is just a sample function that returns True/False. The issue was the late binding of lambda functions.

from functools import partial

def string_check(string, search):
    return search in string

search_terms = ['foo', 'bar']
for s in search_terms:
    string_check_partial = partial(string_check, search=s)
    df[s] = df['YOUR_STRING_COL'].apply(string_check_partial)

Upvotes: 0

MRocklin
MRocklin

Reputation: 57271

You will want to call .compute() at the end of your computation to trigger work. Warning: .compute assumes that your result will fit in memory

Also, watch out, lambdas late-bind in Python, so the field value may end up being the same for all of your columns.

Upvotes: 2

Related Questions