jonjon
jonjon

Reputation: 71

Creating summing and dividing in pandas groupby object

I have a list of cities and a dummy variable. A 1 indicates there was a fraud order and 0 indicates the order was good. I'm trying to determine what percent of the time an order was fraudulent.

Therefore, I'm trying to:

  1. Group by city
  2. Create a sum of the number of orders they made,
  3. Then determine what percent of orders were fraud.

My dataframe:

df_city = pd.DataFrame({'City':['LA', 'NYC', 'NYC', 'LA', 'Chicago', 'NYC'],
       'isFraud':[1, 0, 0, 1, 0, 1]})

I can group by the city:

group = df['isFraud'].groupby(df['City']).count()

City     LA          2
         NYC         3
         Chicago     1

But the above doesn't break out whether they met the threshold.

I created columns for each dummy:

df_fraud = pd.get_dummies(df_city['isFraud'])
df_city = df_city.join(df_fraud)

df_city.rename(columns={0 : 'Non-Fraud', 1 : 'Fraud'}, inplace=True)
df_city['Fraud'].where(df_city['Fraud'] == 0, np.nan)
df_city['Non-Fraud'].where(df_city['Non-Fraud'] == 0, 1)

#Group by City
group_fraud_rate = df_city[['Fraud', 'Non-Fraud']].groupby(df_city['ship_city']).count()`

          Fraud   Non-Fraud
LA         2         0
NYC        1         2
Chicago    0         1

However, I'm not sure if/how I can get a sum column and then calculate a percent.

Any help is appreciated!

Upvotes: 2

Views: 1239

Answers (2)

Scott Boston
Scott Boston

Reputation: 153460

The easiest, pythonic way to get a percent in a 0/1, True/False column is to take the mean.

df_out = df_city.groupby('City')['isFraud'].agg([lambda x: (~x.astype(bool)).sum(),'sum','mean'])
df_out.columns = ['Non-Fraud','Fraud','Pct. Fraud']
print(df_out)

Output:

         Non-Fraud  Fraud  Pct. Fraud
City                                 
Chicago          1      0    0.000000
LA               0      2    1.000000
NYC              2      1    0.333333

Upvotes: 1

Allen Qin
Allen Qin

Reputation: 19947

You can use agg function and then calculate fraud, non-fraud and fraud percent separately.

df.groupby(df['City'])['isFraud'].agg({'Fraud':sum, 'Non-Fraud': lambda x: len(x)-sum(x), 'Fraud_Pct': lambda x: float(sum(x))/len(x)})
Out[1084]: 
         Non-Fraud  Fraud  Fraud_Pct
City                                
Chicago          1      0   0.000000
LA               0      2   1.000000
NYC              2      1   0.333333

Another way to do this without using dict renaming:

df.groupby(df['City'])['isFraud'].agg(['sum', lambda x: len(x)-sum(x),'mean']).rename(columns={'sum':'Fraud','<lambda>': 'Non-Fraud','mean': 'Fraud_Pct'})

Upvotes: 3

Related Questions