Fakir
Fakir

Reputation: 141

A more efficient way to find percentage

From a dataframe after using groupby want to identify the success/unsuccessful rate. With my limited knowledge in Python and pandas though I made it, but looks inefficient and clumsy. There must be some simpler and efficient way. Any help would be much appreciated.

BldgID  BldHt   Device  Date      Time     Result
1213    35      758     20181120  105550   OK
1213    35      758     20181120  105540   NG
1112    40      780     20181120  003102   OK
1117    26      790     20181120  002557   OK
1111    65      780     20181120  002102   NG
1214    80      758     20181120  001600   OK
2902    34      780     20181119  005410   OK
1232    90      780     20181119  001410   OK
1222    75      760     20181119  010552   OK
1214    80      758     20181119  010052   OK
1214    80      758     20181119  005553   NG
1246    16      790     20181119  004556   OK
1128    15      758     20181119  004552   OK
1128    15      758     20181119  004552   OK
1211    30      790     20181119  003557   NG
1211    30      790     20181119  003558   OK

Here is the dataframe (csv). At first I want to group the [BldgID, BldHt, Device] and find their percentage of Result, i.e. OK/OK+NG. What I did was, change 'OK' to '1' and 'NG' to '0'. Sum all the 'OK'. To find the total number (OK+NG), I used the dataframe before changing their value (1 & 0) and calculated the total num. Otherwise, it did not count the NGThen dividing them to find the percentage of 'OK'.

df = pd.read_csv("data.csv")
df1 = df.groupby(['BldgID','Device'])['Result'].agg('sum').reset_index()
df1 = df.replace({'OK':1, 'NG': 0})
df1 = df1.groupby(['BldgID','Device'])['Result'].agg('sum').reset_index()
df1['NumOKs'] = df1['Result']

# used the original df i.e. change OK=1, NG=0, to count the total num of
# OK+NG
df2 = df.groupby(['BldgID','Device']) 
         ['Result'].agg('count').reset_index()

df2['sum'] = mel_df2['Result']
df2.drop(['Result'], axis=1, inplace=True)

df3 = pd.concat([df1['NumOKs'], df2['sum']], axis=1, keys= 
['NumOKs','sum'])

df3.head(10)
# sum represents OK+NG
       NumOKs     sum
 0       2        2
 1       6        6
 2       2        2
 3       2        2
 4       2        2
 5       3        4
 6       3        3
 7       3        3
 8       2        3
 9       3        3

I got what I wanted but it looks very ugly and inefficient. But with my limited knowledge I have to find something quick and it works. Want to achieve this in a better and simpler way. note: The result may not be exactly reproduced as I cut-paste a small part of code and result to explain what I want and how (badly) I did it.

Upvotes: 0

Views: 288

Answers (1)

Scott Boston
Scott Boston

Reputation: 153470

IIUC, try using mean and boolean condition:

df.groupby(['BldgID', 'Device'])['Result'].apply(lambda x : (x=='OK').mean())

To add this to your dataframe:

df['mean_ok'] = df.groupby(['BldgID', 'Device'])['Result']\
                  .apply(lambda x : (x=='OK').mean())

Upvotes: 2

Related Questions