Aaditya Ura
Aaditya Ura

Reputation: 12679

Identifying statistical outliers with pandas: groupby and reduce rows into different dataframe

I'm trying to understand how to identify statistical outliers in groups of dataframe. I will need to group the rows by the conditions and then reduce those groups into a single row and later find the outliers in all reduced rows.

df = pd.DataFrame({'X0': {0: 1, 1: 1, 2: 1, 3: 1, 4: 0, 5: 1, 6: 1, 7: 1, 8: 0, 9: 1, 10: 0, 11: 1, 12: 0, 13: 1, 14: 1, 15: 1, 16: 0, 17: 0, 18: 0, 19: 1, 20: 0, 21: 1, 22: 1, 23: 1, 24: 1, 25: 0, 26: 1, 27: 1, 28: 1, 29: 1, 30: 0, 31: 1, 32: 0, 33: 1, 34: 0, 35: 1, 36: 1, 37: 0, 38: 1, 39: 0, 40: 1, 41: 0, 42: 1, 43: 0, 44: 0, 45: 1, 46: 1, 47: 1, 48: 1, 49: 0}, 'X1': {0: -0.037112917323895135, 1: -0.10487134240202785, 2: -1.2231079338781112, 3: -1.4422747724730558, 4: 1.1903093112171788, 5: 0.16264503017608584, 6: 0.09575885513801816, 7: -0.04065753545650327, 8: 0.9811627656097434, 9: -1.0895731715012618, 10: 1.2771663221280398, 11: 0.20642136730493899, 12: 1.4502341066082816, 13: 0.393823431298906, 14: 0.3451716634317143, 15: 0.4709902758164765, 16: 0.9982956103799087, 17: 1.189077916373609, 18: 0.9234439359961105, 19: -1.3255857892440723, 20: 1.2098373846214483, 21: -1.0264301443260604, 22: -1.2752711175444444, 23: -1.1775143284917524, 24: 0.259560479915767, 25: 0.8873566136283076, 26: 0.21516987874467863, 27: -1.1480968752611762, 28: -1.1903434754082, 29: 0.16553499639585526, 30: -0.027563846470247143, 31: 0.19474819789386086, 32: 1.5332001783034717, 33: -1.4746621814990961, 34: 0.9251147862187328, 35: 0.009242841373200278, 36: -1.4155649459675044, 37: 1.1476020465617858, 38: -1.3349528515873126, 39: 1.3090049690691499, 40: -1.0159692538569027, 41: 1.006261902461321, 42: -1.161160155994317, 43: 0.8833114074575376, 44: 1.0811966452823563, 45: -1.368200135415236, 46: -1.556580287072397, 47: -1.2006674694322674, 48: 0.13038922844618558, 49: 0.9941283827531714}, 'X2': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0, 5: 1, 6: 1, 7: 1, 8: 0, 9: 0, 10: 0, 11: 1, 12: 0, 13: 1, 14: 1, 15: 1, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0, 22: 0, 23: 0, 24: 1, 25: 0, 26: 1, 27: 0, 28: 0, 29: 1, 30: 1, 31: 1, 32: 0, 33: 0, 34: 0, 35: 1, 36: 0, 37: 0, 38: 0, 39: 0, 40: 0, 41: 0, 42: 0, 43: 0, 44: 0, 45: 0, 46: 0, 47: 0, 48: 1, 49: 0}, 'X3': {0: 0, 1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0, 7: 0, 8: 0, 9: 1, 10: 0, 11: 0, 12: 1, 13: 0, 14: 0, 15: 0, 16: 1, 17: 0, 18: 1, 19: 1, 20: 1, 21: 1, 22: 1, 23: 0, 24: 0, 25: 1, 26: 0, 27: 1, 28: 1, 29: 0, 30: 0, 31: 0, 32: 1, 33: 1, 34: 1, 35: 0, 36: 1, 37: 1, 38: 1, 39: 0, 40: 1, 41: 1, 42: 1, 43: 1, 44: 1, 45: 1, 46: 1, 47: 1, 48: 0, 49: 1}, 'X4': {0: 1, 1: 1, 2: 1, 3: 1, 4: 0, 5: 1, 6: 1, 7: 1, 8: 0, 9: 1, 10: 0, 11: 1, 12: 0, 13: 1, 14: 1, 15: 1, 16: 0, 17: 0, 18: 0, 19: 1, 20: 0, 21: 1, 22: 1, 23: 1, 24: 1, 25: 0, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1, 32: 0, 33: 1, 34: 0, 35: 1, 36: 1, 37: 0, 38: 1, 39: 0, 40: 1, 41: 0, 42: 1, 43: 0, 44: 0, 45: 1, 46: 1, 47: 1, 48: 1, 49: 0}, 'X5': {0: -1.6251996907891026, 1: -1.4952824550113089, 2: 0.5929477365851917, 3: 0.5188383985894559, 4: 0.8379329230408614, 5: -1.459754180360659, 6: -1.3954747896019781, 7: -1.4228738797414382, 8: 0.7961049502619677, 9: 0.5969844287269782, 10: 0.6254616540670719, 11: -1.1973174138607352, 12: 0.6743779844553507, 13: -1.3773048616218415, 14: -1.5502881165079259, 15: -1.410649926526345, 16: 0.966418551153225, 17: 0.8413042649713098, 18: 0.5947398261267023, 19: 0.5285211133411081, 20: 0.8154880527487283, 21: 0.685523955516477, 22: 0.7052301139466511, 23: 0.5694387744666269, 24: -1.3660759251156689, 25: 0.7376392137717523, 26: -1.2965881798979835, 27: 0.3247985508699227, 28: 0.8492845744063385, 29: -1.3631982627466268, 30: -1.5593937453283628, 31: -1.5647378670163918, 32: 0.7184017737689418, 33: 0.5401478202493889, 34: 0.8549277265014412, 35: -1.4324174459510242, 36: 0.5699907448414805, 37: 0.5278269967299144, 38: 0.6544095431196703, 39: 0.9956765313323911, 40: 0.49341021793456574, 41: 0.8777030715347666, 42: 0.5628001790223106, 43: 0.6932468790071539, 44: 0.5944907552098264, 45: 0.6628094310909329, 46: 0.660678722318602, 47: 0.68454503898171, 48: -1.5961965190965848, 49: 0.7606527604851616}, 'X6': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1, 5: 0, 6: 0, 7: 0, 8: 1, 9: 0, 10: 1, 11: 0, 12: 1, 13: 0, 14: 0, 15: 0, 16: 1, 17: 1, 18: 1, 19: 0, 20: 1, 21: 0, 22: 0, 23: 0, 24: 0, 25: 1, 26: 0, 27: 0, 28: 0, 29: 0, 30: 0, 31: 0, 32: 1, 33: 0, 34: 1, 35: 0, 36: 0, 37: 1, 38: 0, 39: 1, 40: 0, 41: 1, 42: 0, 43: 1, 44: 1, 45: 0, 46: 0, 47: 0, 48: 0, 49: 1}, 'X7': {0: 0, 1: 0, 2: 1, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 1, 22: 0, 23: 1, 24: 0, 25: 0, 26: 0, 27: 0, 28: 1, 29: 0, 30: 0, 31: 0, 32: 0, 33: 1, 34: 0, 35: 0, 36: 1, 37: 0, 38: 1, 39: 0, 40: 1, 41: 0, 42: 0, 43: 0, 44: 0, 45: 1, 46: 0, 47: 1, 48: 0, 49: 0}})

enter image description here

Using a dataset like this I would like to group by different conditions such as:

df1 = df[(df['X0'] == 0) & (df['X2'] == 0)]
df2 = df[(df['X0'] == 1) & (df['X3'] == 1)]
df3 = df[(df['X0'] == 1) & (df['X4'] == 1)]
df4 = df[(df['X0'] == 1) & (df['X7'] == 0)]

df5 = df[(df['X2'] == 1) & (df['X6'] == 0)]
df6 = df[(df['X2'] == 1) & (df['X4'] == 1)]
df7 = df[(df['X2'] == 1) & (df['X3'] == 0)]

df8 = df[(df['X3'] == 1) & (df['X7'] == 0)]
df9 = df[(df['X3'] == 1) & (df['X6'] == 1)]

df10 = df[(df['X7'] == 1) & (df['X2'] == 0)]

At this step, I am reducing each data frame into a single row, for that, I have a few ideas, a straightforward way is to take the mean of each dataframe but the problem is some of the columns are categorical and some of them are continuous, to take the mean of the entire data frame, I am converting the categorical columns into freq count columns :

cat_columns = ['X0', 'X2', 'X3', 'X4', 'X6', 'X7']
def add_freq(df, column_name):
    
    for col in cat_columns:
        df[f'{col}_freq'] = df[col].map(df[col].value_counts())
        df = df.drop(col, axis = 1)
    return df



all_groups  = [df1, df2, df3, df4, df5, 
              df6, df7, df8, df9, df10]

all_freq_df = [add_freq(group, cat_columns) for group in all_groups]

which looks like this for each df group:

enter image description here

Now I can take the mean and reduce the data frames into a single row :

all_mean_df = [pd.DataFrame(group.mean()).T for group in all_freq_df]

concatinating all reduced rows in single dataframe :

all_groups  = pd.concat(all_mean_df).reset_index(drop=True)

The final reduced rows data frame looks like this, where each row represents reduced data frame group:

enter image description here

I want to find the outliers in this reduced dataset, I tried to find outliers using zscore such as :

from scipy import stats
all_groups[(np.abs(stats.zscore(all_groups)) < 3).all(axis=1)]

But it doesn't seem to work. I feel like there has to be a way to do this without too much complexity but I've been stuck on how to proceed.

How can I reduce the groups into single rows and find the outliers in the reduced dataset?

Upvotes: 7

Views: 1322

Answers (3)

MAFiA303
MAFiA303

Reputation: 1327

I used groupby with mean to reduce rows

dfx = pd.DataFrame()
for idx,group in enumerate(groups):
    group = group.groupby(cat_columns).mean().reset_index()
    group['group'] = 'df' + str(idx+1).zfill(2)
    dfx = pd.concat([dfx, group])

enter image description here

I imagined the binary columns are input and decimal are output, so i pivot to make them into one row

making new column named input:

dfx['input'] = dfx[cat_columns].apply(lambda x: ''.join(x), axis=1)

enter image description here

pivoting the needed columns and adjusting the names:

dfx.drop(cat_columns, axis=1, inplace=True)
dfxp = dfx.pivot(index='group', columns='input', values=['X1', 'X5'])
dfxp = dfxp.reset_index().fillna(0)
dfxp.columns = ['_'.join(x) for x in dfxp.columns]

enter image description here

reducing the columns with PCA

from sklearn.decomposition import PCA
dft = PCA(n_components=2, svd_solver='full').fit_transform(dfxp.drop(columns = 'group_'))
dft = pd.DataFrame(dft, columns=['x', 'y'], index=dfxp.group_).reset_index()
px.scatter(dft, x='x', y='y', color='group_', title='t-SNE', height=600, width=600)

enter image description here

df7 seems the outlier.

Upvotes: 3

JeeyCi
JeeyCi

Reputation: 597

in practice for Gaussian distribution: data_cases that falls outside mean+/-3*st.dev. are considered to be outliers (as is outside the 99.7% range of distribution)... Thus, for Gaussian & Gaussian-like distributions I would better rewrite the previous answer to:

data= df[col].values
_mean, _std = np.mean(data), np.std(data)
border = _std * 3
lower, upper = _mean - border, _mean + border
#  outliers
outliers = [x for x in data if x < lower or x > upper]

with data - being taken in for-loop for each column_range that you need ___ P.S. sometimes even 4 st.dev. can be used (covering 99.9% of distribution), but 1 st.dev. is only 68%, 2 st.dev - 95%... So, be sure what you really need

Upvotes: 0

arlaine
arlaine

Reputation: 311

get the mean and std. We need to loop over each column, get the mean and std, then set the max and min value we accept for this column.

# Storring mean and std for every col as a tuple, 0 index for max value,
# and 1 for min value
outliers = []
for col in df.columns:
   mean = np.mean(df[col].values)
   std = np.std(df[col].std)
   # You can play with the max and min below !
   outliers.append((mean + std, mean - std))
# Then you have the list of tuples, with each tuple representing the max and min value you accept the column (index related).

Upvotes: 1

Related Questions