PyRaider
PyRaider

Reputation: 689

Outliers in Works Schedule

I am trying to find outliers for work schedules for individuals (mostly high variations). trying to find, if someone comes or leaves way outside the individual (8:30am to 5pm) or group normals (7am to 6pm). I tried using standard deviation but the problem with that is,

  1. It gives me outliers on both sides of the mean. That is if some comes in late during work hours (say 10am)or leaves early (say 4pm).
  2. Another problem is the mean itself. It takes lot of observations to bring down the mean to most frequent times if there are few extremes at the beginning of the data set. For example, one set had few in-times around 3pm, 11am, 10am, 9am but most of them were around 6am, but the mean took a lot of observations to get 6am mean. I thought of weighted averages but that would mean I will have to round up times to nearest 30 mins or so. But would like to avoid changing data points.

Is there any known way to find outliers in work schedule? I tried to search but all I get is outliers in time-series. But I am looking for outliers in time itself. Any suggestions?

Note: My data set has PersonID and multiple (swipe) times/day/PersonID. And I am using python 2.7.

Upvotes: 1

Views: 218

Answers (1)

AChervony
AChervony

Reputation: 663

If I understand correctly, you are looking to identify people who depart extremely early or arrive extremely late compared to their own and overall norms.

  1. The first issue you encountered seems related to tagging outliers that deviate by (N x standard deviation) late or early. You should be able to control whether to tag outliers on one or both sides.
  2. The second issue has to do with mean being biased or unstable with a small sample. Unless you know the outlier thresholds ahead of time, you will need a healthy sample to identify outliers in any case. If mean does not zero in quick enough to the common, and you seek the common value, use Mode instead of mean.

Also, I would suggest looking at the daily hours - difference between arrival and departure each day as a separate metric.

Below I have a directional approach / suggestion to tackle your problem, python3 (sorry).
It should address the issues you mentioned but does not add the daily hours I think you should include.

This is the output you can expect:

Outlier PersonIDs based on overall data
array([ 1.,  4.,  7.,  8.])
Outlier PersonIDs based on each user's data and overall deviation
array([ 1.,  3.,  4.,  5.,  7.,  8.,  9.])

This is the daily arrival and departure time distributions: enter image description here

Here's the code:

#! /usr/bin/python3

import random
import pandas as pd
import numpy as np
import scipy.stats
import pprint
pp = pprint.PrettyPrinter(indent=4)

# Visualize:
import matplotlib.pyplot as plt

#### Create Sample Data START
# Parameters:
TimeInExpected=8.5 # 8:30am
TimeOutExpected=17 # 5pm
sig=1 # 1 hour variance
Persons=11
# Increasing the ratio between sample size and persons will make more people outliers.
SampleSize=20
Accuracy=1 # Each hour is segmented by hour tenth (6 minutes)

# Generate sample
SampleDF=pd.DataFrame([
    np.random.randint(1,Persons,size=(SampleSize)),
    np.around(np.random.normal(TimeInExpected, sig,size=(SampleSize)),Accuracy),
    np.around(np.random.normal(TimeOutExpected, sig,size=(SampleSize)),Accuracy)
    ]).T
SampleDF.columns = ['PersonID', 'TimeIn','TimeOut']

# Visualize
plt.hist(SampleDF['TimeIn'],rwidth=0.5,range=(0,24))
plt.hist(SampleDF['TimeOut'],rwidth=0.5,range=(0,24))
plt.xticks(np.arange(0,24, 1.0))
plt.xlabel('Hour of day')
plt.ylabel('Arrival / Departure Time Frequency')
plt.show()
#### Create Sample Data END


#### Analyze data 
# Threshold distribution percentile
OutlierSensitivity=0.05 # Will catch extreme events that happen 5% of the time. - one sided! i.e. only late arrivals and early departures.
presetPercentile=scipy.stats.norm.ppf(1-OutlierSensitivity)

# Distribution feature and threshold percentile
argdictOverall={
    "ExpIn":SampleDF['TimeIn'].mode().mean().round(1)
    ,"ExpOut":SampleDF['TimeOut'].mode().mean().round(1)
    ,"sigIn":SampleDF['TimeIn'].var()
    ,"sigOut":SampleDF['TimeOut'].var()
    ,"percentile":presetPercentile
}
OutlierIn=argdictOverall['ExpIn']+argdictOverall['percentile']*argdictOverall['sigIn']
OutlierOut=argdictOverall['ExpOut']-argdictOverall['percentile']*argdictOverall['sigOut']

# Overall
# See all users with outliers - overall
Outliers=SampleDF["PersonID"].loc[(SampleDF['TimeIn']>OutlierIn) | (SampleDF['TimeOut']<OutlierOut)]

# See all observations with outliers - Overall
# pp.pprint(SampleDF.loc[(SampleDF['TimeIn']>OutlierIn) | (SampleDF['TimeOut']<OutlierOut)].sort_values(["PersonID"]))

# Sort and remove NAs
Outliers=np.sort(np.unique(Outliers))
# Show users with overall outliers:
print("Outlier PersonIDs based on overall data")
pp.pprint(Outliers)

# For each
OutliersForEach=[]
for Person in SampleDF['PersonID'].unique():
    # Person specific dataset
    SampleDFCurrent=SampleDF.loc[SampleDF['PersonID']==Person]
    # Distribution feature and threshold percentile
    argdictCurrent={
        "ExpIn":SampleDFCurrent['TimeIn'].mode().mean().round(1)
        ,"ExpOut":SampleDFCurrent['TimeOut'].mode().mean().round(1)
        ,"sigIn":SampleDFCurrent['TimeIn'].var()
        ,"sigOut":SampleDFCurrent['TimeOut'].var()
        ,"percentile":presetPercentile
    }
    OutlierIn=argdictCurrent['ExpIn']+argdictCurrent['percentile']*argdictCurrent['sigIn']
    OutlierOut=argdictCurrent['ExpOut']-argdictCurrent['percentile']*argdictCurrent['sigOut']
    if SampleDFCurrent['TimeIn'].max()>OutlierIn or SampleDFCurrent['TimeOut'].min()<OutlierOut:
        Outliers=np.append(Outliers,Person)

# Sort and get unique values
Outliers=np.sort(np.unique(Outliers))
# Show users with overall outliers:
print("Outlier PersonIDs based on each user's data and overall deviation")
pp.pprint(Outliers)

Upvotes: 1

Related Questions