Efficient use of If Then scenarios in Dask when creating a new column

Question

I have a csv with about 11m rows that I'm reading into a dask dataframe. I am attempting to create a new column that is the result of an if/then/else scenario. I'm having some trouble understanding how to get it to work, and just as importantly hot to get it to work efficiently. I'm new to pandas/dask.

Basically this is what I've tried: Calling a function from the create column event. This is a simplified example of what I've been trying.

#var1 = 0
#var2 = 10

def find_situation:
    If (var1 == 0 and var2 > 10):
        print("Situation 1")
    elif var1 == 0 and var2 < 10:
        print("Situation 2")
    else:
        print("No Situation")

ddf['situation'] = ddf.apply(find_situation(ddf['ddfvar1'], ddf['ddfvar2']))

This approach results in an error message stating "ValueError: The truth value of a Series is ambiguous. Use a.any() or a.all()." The help topics on these actions read like any or all values from the row of data being parsed will be considered, rather than the values I'm passing to the function?

Also, I read that vectorising is far faster, but I'm not sure if this is the sort of scenario where the query can be vectorised?

The long version, where I'm just simply trying to determine the value in the month column as a starting point. Really I need to go toward the type of compound if statements I'd made in the simplified example.:

import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
import pandas as pd

# Dataframes implement the Pandas API
import dask.dataframe as dd

def f(x):
    if x == 9:
        y = 'Nine'
    elif x == 2:
        y= 'Two'
    else :
        y= 1 
    return y


ddf['AndHSR'] = ddf.apply(f(ddf['Month']))

Efficient use of If Then scenarios in Dask when creating a new column

Answers (1)

Related Questions