Reputation: 71
I have a list of cities and a dummy variable. A 1 indicates there was a fraud order and 0 indicates the order was good. I'm trying to determine what percent of the time an order was fraudulent.
Therefore, I'm trying to:
My dataframe:
df_city = pd.DataFrame({'City':['LA', 'NYC', 'NYC', 'LA', 'Chicago', 'NYC'],
'isFraud':[1, 0, 0, 1, 0, 1]})
I can group by the city:
group = df['isFraud'].groupby(df['City']).count()
City LA 2
NYC 3
Chicago 1
But the above doesn't break out whether they met the threshold.
I created columns for each dummy:
df_fraud = pd.get_dummies(df_city['isFraud'])
df_city = df_city.join(df_fraud)
df_city.rename(columns={0 : 'Non-Fraud', 1 : 'Fraud'}, inplace=True)
df_city['Fraud'].where(df_city['Fraud'] == 0, np.nan)
df_city['Non-Fraud'].where(df_city['Non-Fraud'] == 0, 1)
#Group by City
group_fraud_rate = df_city[['Fraud', 'Non-Fraud']].groupby(df_city['ship_city']).count()`
Fraud Non-Fraud
LA 2 0
NYC 1 2
Chicago 0 1
However, I'm not sure if/how I can get a sum column and then calculate a percent.
Any help is appreciated!
Upvotes: 2
Views: 1239
Reputation: 153460
The easiest, pythonic way to get a percent in a 0/1, True/False column is to take the mean.
df_out = df_city.groupby('City')['isFraud'].agg([lambda x: (~x.astype(bool)).sum(),'sum','mean'])
df_out.columns = ['Non-Fraud','Fraud','Pct. Fraud']
print(df_out)
Output:
Non-Fraud Fraud Pct. Fraud
City
Chicago 1 0 0.000000
LA 0 2 1.000000
NYC 2 1 0.333333
Upvotes: 1
Reputation: 19947
You can use agg function and then calculate fraud, non-fraud and fraud percent separately.
df.groupby(df['City'])['isFraud'].agg({'Fraud':sum, 'Non-Fraud': lambda x: len(x)-sum(x), 'Fraud_Pct': lambda x: float(sum(x))/len(x)})
Out[1084]:
Non-Fraud Fraud Fraud_Pct
City
Chicago 1 0 0.000000
LA 0 2 1.000000
NYC 2 1 0.333333
Another way to do this without using dict renaming:
df.groupby(df['City'])['isFraud'].agg(['sum', lambda x: len(x)-sum(x),'mean']).rename(columns={'sum':'Fraud','<lambda>': 'Non-Fraud','mean': 'Fraud_Pct'})
Upvotes: 3