anakaine
anakaine

Reputation: 1248

Efficient use of If Then scenarios in Dask when creating a new column

I have a csv with about 11m rows that I'm reading into a dask dataframe. I am attempting to create a new column that is the result of an if/then/else scenario. I'm having some trouble understanding how to get it to work, and just as importantly hot to get it to work efficiently. I'm new to pandas/dask.

Basically this is what I've tried: Calling a function from the create column event. This is a simplified example of what I've been trying.

#var1 = 0
#var2 = 10

def find_situation:
    If (var1 == 0 and var2 > 10):
        print("Situation 1")
    elif var1 == 0 and var2 < 10:
        print("Situation 2")
    else:
        print("No Situation")

ddf['situation'] = ddf.apply(find_situation(ddf['ddfvar1'], ddf['ddfvar2']))

This approach results in an error message stating "ValueError: The truth value of a Series is ambiguous. Use a.any() or a.all()." The help topics on these actions read like any or all values from the row of data being parsed will be considered, rather than the values I'm passing to the function?

Also, I read that vectorising is far faster, but I'm not sure if this is the sort of scenario where the query can be vectorised?

The long version, where I'm just simply trying to determine the value in the month column as a starting point. Really I need to go toward the type of compound if statements I'd made in the simplified example.:

import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
import pandas as pd

# Dataframes implement the Pandas API
import dask.dataframe as dd

def f(x):
    if x == 9:
        y = 'Nine'
    elif x == 2:
        y= 'Two'
    else :
        y= 1 
    return y


ddf['AndHSR'] = ddf.apply(f(ddf['Month']))

Upvotes: 1

Views: 461

Answers (1)

Henry Yik
Henry Yik

Reputation: 22493

You can use np.select for vectorised approach.

import pandas as pd
import numpy as np

np.random.seed(500)

df = pd.DataFrame({"var1":np.random.randint(0,20,10000000),
                   "var2":np.random.randint(0,25,10000000)})

df["result"] = np.select([(df["var1"]==0)&(df["var2"]>10),
                          (df["var1"]==0)&(df["var2"]<10)],  #list of conditions
                         ["Situation 1", "Situation 2"],     #list of results
                         default="No situation")             #default if no match

print (df.groupby("result").count())

#
                     var1     var2
result                        
No situation  9520776  9520776
Situation 1    279471   279471
Situation 2    199753   199753

Upvotes: 1

Related Questions