johnnyb
johnnyb

Reputation: 1815

Find outliers of data

I am attempting to find the outliers of seconds with standard deviations. I have two dataframes as below. The outliers I am trying to find are 1.5 standard deviations away from the mean by day of week? Current code is below the dataframes.

df1:

name    dateTime              Seconds
joe     2015-02-04 12:12:12   54321.0202
john    2015-01-02 13:13:13   12345.0101
joe     2015-02-04 12:12:12   54321.0202
john    2015-01-02 13:13:13   12345.0101
joe     2015-02-04 12:12:12   54321.0202
john    2015-01-02 13:13:13   12345.0101
joe     2015-02-04 12:12:12   54321.0202
john    2015-01-02 13:13:13   12345.0101
joe     2015-02-04 12:12:12   54321.0202
john    2015-01-02 13:13:13   12345.0101
joe     2015-02-04 12:12:12   54321.0202
joe     2015-01-02 13:13:13   12345.0101

current output: df2

name   day   standardDev        mean           count
Joe    mon   22326.502700       40900.730647   1886
       tue   9687.486726        51166.213836   159
john   mon   10072.707891       41380.035108   883
       tue   5499.475345        26985.938776   196

Expected output:

df2

name   day   standardDev        mean           count     events
Joe    mon   22326.502700       40900.730647   1886      [2015-02-04 12:12:12, 2015-02-04 12:12:13]
       tue   9687.486726        51166.213836   159       [2015-02-04 12:12:12, 2015-02-04 12:12:14]
john   mon   10072.707891       41380.035108   883       [2015-01-02 13:13:13, 2015-01-02 13:13:15]
       tue   5499.475345        26985.938776   196       [2015-01-02 13:13:13, 2015-01-02 13:13:18]

CODE:

allFiles = glob.glob(folderPath + "/*.csv")
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_, index_col=None, names=['EventTime', "IpAddress", "Hostname", "TargetUserName", "AuthenticationPackageName", "TargetDomainName", "EventReceivedTime"])
    df = df.ix[1:]
    list_.append(df)
df = pd.concat(list_)
df['DateTime'] = pd.to_datetime(df['EventTime'])
df['day_of_week'] = df.DateTime.dt.strftime('%a')
df['seconds'] = pd.to_timedelta(df.DateTime.dt.time.astype(str)).dt.seconds
print(df.groupby((['TargetUserName', 'day_of_week'])).agg({'seconds': {'mean': lambda x: (x.mean()), 'std': lambda x: (np.std(x)), 'count': 'count'}}))

Upvotes: 1

Views: 1903

Answers (1)

JohnE
JohnE

Reputation: 30404

This is a slight adaptation from the pandas docs. I didn't create columns for mean & std, but you could add that pretty easily if you want to see it.

np.random.seed(1111)
df=pd.DataFrame({ 'name':     ['joe','john']*30, 
                  'dateTime': pd.date_range('1-1-2015',periods=60),
                  'Seconds':  np.random.randn(60)+5000. })

grp = df.groupby(['name',df.dateTime.dt.dayofweek])['Seconds']
df['zscore'] = grp.transform( lambda x: (x-x.mean())/x.std())

df[ df['zscore'].abs() > 1.5 ]
Out[79]: 
        Seconds   dateTime  name    zscore
1   4998.927011 2015-01-02  john -1.522488
42  5001.275866 2015-02-12   joe  1.636829
58  4999.124550 2015-02-28   joe -1.624945

df.head(10)
Out[80]:
       Seconds   dateTime  name    zscore
0  4998.699990 2015-01-01   joe -0.959960
1  4998.927011 2015-01-02  john -1.522488
2  5000.790199 2015-01-03   joe  0.263690
3  4999.121735 2015-01-04  john -1.005137
4  5001.501822 2015-01-05   joe  1.132407
5  4999.976071 2015-01-06  john  0.678951
6  5000.275949 2015-01-07   joe  0.650297
7  4999.033607 2015-01-08  john -0.964222
8  4998.419685 2015-01-09   joe -1.328744
9  4999.796325 2015-01-10  john  1.224198

Upvotes: 1

Related Questions