Langutang
Langutang

Reputation: 43

How to compare data frame columns and compare by boolean field in python

I am attempting to compare two columns within the same data frame and let me know if the value is outside 2 standard deviations of a rolling 7 day window.

I have the following code to make a copy of the current column and take its standard deviation

ma_columns = ['Carrier_AcctPswd', 'Carrier_Activation',
       'Carrier_AddRemove_Feature', 'Carrier_Billing', 'Carrier_Call_Quality',
       'Carrier_Client_Specific_App', 'Carrier_DataUsage', 'Carrier_Dev_Rpl',
       'Carrier_Wty_Equip', 'Content_Sync_Various', 'Dev_Hardware_Other',
       'Dev_Hardware_Screen', 'Dev_Issue_Various', 'Dev_Setting_Acct',
       'Dev_Setting_Battery', 'Dev_Setting_Other', 'Education', 'Email',
       'General_QA', 'Make_Receive_Calls', 'Messaging', 'New_Device_Setup',
       'Ntwk_Data_Connection', 'Other', 'Unknown', 'Voice', 'Voicemail',
       'WiFi']

for column in ma_columns:
    st_column = column + '_stdev'
    original_data[st_column] = pd.DataFrame(original_data[column].rolling(window=7).std())
    
    print(original_data)

I now want to match the respective value column to two standard deviations to its rolling column and see if the current value is outside of two standard deviations. EX: "Wifi >= 2x Wifi_stdev | Wifi <= 2x WiFi_stdev". I want to iterate over every column for each variable (about 28) and I am using np.where. Using this code below:

for column in original_data:
    original_data['alert'] = np.where((original_data[original_data.columns[column]] >= 2*original_data[original_data.columns[column+'_stdev']]), 'T', 'F')

I get this error:

<ipython-input-14-82df91b172c5> in <module>
      1 for column in original_data:
----> 2     original_data['alert'] = np.where(((original_data[original_data.columns[column]] >= original_data[original_data.columns[column+'_ma']]) or (original_data[original_data.columns[column]] <= original_data[original_data.columns[column+'_ma']])) , 'T', 'F')
      3 

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __getitem__(self, key)
   3928         if is_scalar(key):
   3929             key = com.cast_scalar_indexer(key)
-> 3930             return getitem(key)
   3931 
   3932         if isinstance(key, slice):

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

I want this to go into a new column called 'alert' as 'T' if it is outside 2 std or 'F' otherwise.

Upvotes: 0

Views: 188

Answers (1)

Rob Raymond
Rob Raymond

Reputation: 31166

As I understand it you want to build up alert by checking each of the columns is within 2 std deviations. Your logic I believe needs to be modified as stddev in absence of mean is not statistically anything. (Normal distribution)

  1. created a data set with a level of randomisation to re-produce your case
  2. loop through columns to progressively set alert as you migrate across the columns
  3. have captured number of rows that have alert==True as the loop progresses
  4. first seven rows will always be false as rolling window is 7 days
import numpy as np
ma_columns = ['Carrier_AcctPswd', 'Carrier_Activation',
       'Carrier_AddRemove_Feature', 'Carrier_Billing', 'Carrier_Call_Quality',
       'Carrier_Client_Specific_App', 'Carrier_DataUsage', 'Carrier_Dev_Rpl',
       'Carrier_Wty_Equip', 'Content_Sync_Various', 'Dev_Hardware_Other',
       'Dev_Hardware_Screen', 'Dev_Issue_Various', 'Dev_Setting_Acct',
       'Dev_Setting_Battery', 'Dev_Setting_Other', 'Education', 'Email',
       'General_QA', 'Make_Receive_Calls', 'Messaging', 'New_Device_Setup',
       'Ntwk_Data_Connection', 'Other', 'Unknown', 'Voice', 'Voicemail',
       'WiFi']
ma_columns = ma_columns
df = pd.DataFrame([{c:random.randint(10, 11)*random.uniform(0.98,1.02) for c in ma_columns} for i in range(40)])
# just to keep track of what went on...
dflog = pd.DataFrame({"col":[], "alerts":[]}).astype({"alerts":"int64"})
# initialise....
df = df.assign(alert=False)
for c in ma_columns:
    df = df.assign(
        # calc upper and lower bound for current column
        upper=lambda dfa: dfa[c].rolling(window=7).std()*2+dfa[c].rolling(window=7).mean(),
        lower=lambda dfa: dfa[c].rolling(window=7).mean()-dfa[c].rolling(window=7).std()*2,
        # keep previous alert, else check current column...
        alert=lambda dfa: np.where(dfa["alert"], dfa["alert"], (dfa[c]>=dfa["upper"]) | (dfa[c]<=dfa["lower"])),
    ).drop(columns=["upper","lower"])
    # not needed but shows what is going on
    dflog = dflog.append({"col":c, "alerts":df.loc[:,["alert"]].sum().values[0]}, ignore_index=True)


output - what happened

col     Carrier_AcctPswd  Carrier_Activation  Carrier_AddRemove_Feature  Carrier_Billing  Carrier_Call_Quality  Carrier_Client_Specific_App  Carrier_DataUsage  Carrier_Dev_Rpl  Carrier_Wty_Equip  Content_Sync_Various  Dev_Hardware_Other  Dev_Hardware_Screen  Dev_Issue_Various  Dev_Setting_Acct  Dev_Setting_Battery  Dev_Setting_Other  Education  Email  General_QA  Make_Receive_Calls  Messaging  New_Device_Setup  Ntwk_Data_Connection  Other  Unknown  Voice  Voicemail  WiFi
alerts                 1                   1                          1                2                     2                            2                  2                3                  3                     3                   3                    6                  7                 7                    7                  7          7      7           7                   8          8                 9                    10     10       10     10         11    12

output sample

 alert  Carrier_AcctPswd  Carrier_Activation  Carrier_AddRemove_Feature  Carrier_Billing  Carrier_Call_Quality  Carrier_Client_Specific_App  Carrier_DataUsage  Carrier_Dev_Rpl  Carrier_Wty_Equip  Content_Sync_Various  Dev_Hardware_Other  Dev_Hardware_Screen  Dev_Issue_Various  Dev_Setting_Acct  Dev_Setting_Battery  Dev_Setting_Other  Education      Email  General_QA  Make_Receive_Calls  Messaging  New_Device_Setup  Ntwk_Data_Connection      Other    Unknown      Voice  Voicemail       WiFi
 False         10.027893           11.180526                  11.036345        11.053842             10.784546                    10.147791          11.025660        10.058375          10.990958             10.090822           10.086956             9.896482          10.148533         10.789600             9.982659          11.109153  10.969188  10.950963    9.945559           10.072013  10.810217         10.144920             10.954820   9.963897  10.923819  11.112625  11.138334  10.850437
  True         10.078542           10.176072                   9.888113        10.187253              9.874345                    10.192872          10.015289        10.147271          10.050714             10.880132           10.963420            10.147965           9.951384         11.183593            10.974927           9.907986  10.901859   9.975108   10.108487           10.158522   9.892944         10.957816              9.985727  10.160018   9.932426  10.129687  10.930133  10.144481
 False         10.115461           11.108922                  10.979528        10.867064              9.966132                    11.093651          10.015981         9.840299          10.044067             10.936187           10.880957            10.153535          11.031433          9.834146            10.189683          11.024977  10.117397  10.786052   10.933146           11.203619   9.838307         10.112361             10.159005  11.187097  10.099687   9.974852  10.162924  11.142788
 False         10.837826            9.882999                  11.028710        10.082859              9.926954                    10.151154           9.989557        10.032212          10.103652             10.073667           10.048499             9.941714           9.824670         10.033737            10.043848           9.997618   9.915864  10.057772   10.084293           10.107702  11.072084         11.205637             11.011373   9.837888  11.028028  10.192616  10.123548   9.860274
  True         10.183415           10.044214                  10.180521        11.017100             10.162875                    10.926751           9.905115        11.150269          10.853725             10.888850            9.971950             9.952609          10.175848         11.049947            10.113041          10.926425  10.158558  11.055091   10.917608           10.921025   9.917453          9.937600             11.089962  11.006900  11.054964  11.099928  11.015369  10.094890
  True         11.110304           10.067525                  10.804917        10.122729             10.786908                    10.087151          11.130379        10.165037          10.067856              9.813646            9.992265            11.164745           9.987881         10.848681             9.934480           9.875899  11.128687  10.076104   11.106895           11.084006  10.960849         10.834289             11.195834  10.082106  10.043681  10.837019  10.036509  10.952554
 False          9.876790            9.868787                  10.822574        11.155736             10.176943                    11.173443          10.929224         9.942439          10.023803             10.060724           11.147590            10.981729           9.988948         10.912586            11.009330          11.144846   9.943222  10.821988   10.114765            9.817857  10.073799         10.984551             11.059509  11.013951  10.025922  10.789054  10.828943  11.091649

Upvotes: 1

Related Questions