Reputation: 1248
I have a csv with about 11m rows that I'm reading into a dask dataframe. I am attempting to create a new column that is the result of an if/then/else scenario. I'm having some trouble understanding how to get it to work, and just as importantly hot to get it to work efficiently. I'm new to pandas/dask.
Basically this is what I've tried: Calling a function from the create column event. This is a simplified example of what I've been trying.
#var1 = 0
#var2 = 10
def find_situation:
If (var1 == 0 and var2 > 10):
print("Situation 1")
elif var1 == 0 and var2 < 10:
print("Situation 2")
else:
print("No Situation")
ddf['situation'] = ddf.apply(find_situation(ddf['ddfvar1'], ddf['ddfvar2']))
This approach results in an error message stating "ValueError: The truth value of a Series is ambiguous. Use a.any() or a.all()." The help topics on these actions read like any or all values from the row of data being parsed will be considered, rather than the values I'm passing to the function?
Also, I read that vectorising is far faster, but I'm not sure if this is the sort of scenario where the query can be vectorised?
The long version, where I'm just simply trying to determine the value in the month column as a starting point. Really I need to go toward the type of compound if statements I'd made in the simplified example.:
import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
import pandas as pd
# Dataframes implement the Pandas API
import dask.dataframe as dd
def f(x):
if x == 9:
y = 'Nine'
elif x == 2:
y= 'Two'
else :
y= 1
return y
ddf['AndHSR'] = ddf.apply(f(ddf['Month']))
Upvotes: 1
Views: 461
Reputation: 22493
You can use np.select
for vectorised approach.
import pandas as pd
import numpy as np
np.random.seed(500)
df = pd.DataFrame({"var1":np.random.randint(0,20,10000000),
"var2":np.random.randint(0,25,10000000)})
df["result"] = np.select([(df["var1"]==0)&(df["var2"]>10),
(df["var1"]==0)&(df["var2"]<10)], #list of conditions
["Situation 1", "Situation 2"], #list of results
default="No situation") #default if no match
print (df.groupby("result").count())
#
var1 var2
result
No situation 9520776 9520776
Situation 1 279471 279471
Situation 2 199753 199753
Upvotes: 1