Reputation: 267
I want to do the following to my dataframe:
xlsx
fileSee below for sample data
np.random.seed([5, 1591])
df = pd.DataFrame(
np.random.normal(size=(16,5)),
columns=list('ABCDE')
)
df
A B C D E
0 -1.685112 -0.432143 0.876200 1.626578 1.512677
1 0.401134 0.439393 1.027222 0.036267 -0.655949
2 -0.074890 0.312793 -0.236165 0.660909 0.074468
3 0.842169 2.759467 0.223652 0.432631 -0.484871
4 -0.619873 -1.738938 -0.054074 0.337663 0.358380
5 0.083653 0.792835 -0.643204 1.182606 -1.207692
6 -1.168773 -1.456870 -0.707450 -0.439400 0.319728
7 2.316974 -0.177750 1.289067 -2.472729 -1.310188
8 2.354769 1.099483 -0.653342 -0.532208 0.269307
9 0.431649 0.666982 0.361765 0.419482 0.531072
10 -0.124268 -0.170720 -0.979012 -0.410861 1.000371
11 -0.392863 0.933516 -0.502608 -0.759474 -1.364289
12 1.405442 -0.297977 0.477609 -0.046791 -0.126504
13 -0.711799 -1.042558 -0.970183 -1.672715 -0.524283
14 0.029966 -0.579152 0.648176 0.833141 -0.942752
15 0.824767 0.974580 0.363170 0.428062 -0.232174
The desired outcome should look something like this:
## I want to ONLY identify the outliers NOT remove or substitute them. I only used NaN to depict the outlier value. Ideally, the outlier values cell should be colored/highlighted 'red'.
## Please note: the outliers NaN in the sample are randomly assigned.
A B C D E Anomaly_Count
0 NaN -0.432143 0.876200 NaN 1.512677 2
1 0.401134 0.439393 1.027222 0.036267 -0.655949 0
2 -0.074890 0.312793 -0.236165 0.660909 0.074468 0
3 0.842169 NaN 0.223652 0.432631 -0.484871 1
4 -0.619873 -1.738938 -0.054074 0.337663 0.358380 0
5 0.083653 0.792835 -0.643204 NaN NaN 2
6 -1.168773 -1.456870 -0.707450 -0.439400 0.319728 0
7 2.316974 -0.177750 1.289067 -2.472729 -1.310188 0
8 2.354769 1.099483 -0.653342 -0.532208 0.269307 0
9 0.431649 0.666982 0.361765 0.419482 0.531072 0
10 -0.124268 -0.170720 -0.979012 -0.410861 NaN 1
11 -0.392863 0.933516 -0.502608 -0.759474 -1.364289 0
12 1.405442 -0.297977 0.477609 -0.046791 -0.126504 0
13 -0.711799 -1.042558 -0.970183 -1.672715 -0.524283 0
14 0.029966 -0.579152 0.648176 0.833141 -0.942752 0
15 0.824767 NaN 0.363170 0.428062 -0.232174 1
See below for my attempt, I am open to other approaches
import numpy as np
from scipy import stats
def outlier_detection (data):
# step I: identify the outliers in each row
df[(np.abs(stats.zscore(df)) < 3).all(axis = 0)] # unfortunately this removes the outliers which I dont want
# step II: color/highlight the outlier cell
df = df.style.highlight_null('red')
# Step III: count the number of outliers in each row
df['Anomaly_count'] = df.isnull().sum(axis=1)
# step IV: export as xlsx file
df.to_excel(r'Path to store the exported excel file\File Name.xlsx', sheet_name='Your sheet name', index = False)
outlier_detection(df)
Thanks for your time.
Upvotes: 2
Views: 3151
Reputation: 1238
This works for me
import numpy as np
import pandas as pd
from scipy import stats
np.random.seed([5, 1591])
df = pd.DataFrame(
np.random.normal(size=(16, 5)),
columns=list('ABCDE')
)
mask = pd.DataFrame(abs(stats.zscore(df)) > 1, columns=df.columns)
df["Count"] = mask.sum(axis=1)
mask["Count"] = False
style_df = mask.applymap(lambda x: "background-color: red" if x else "")
sheet_name = "Values"
with pd.ExcelWriter("score_test.xlsx", engine="openpyxl") as writer:
df.style.apply(lambda x: style_df, axis=None).to_excel(writer,
sheet_name=sheet_name,
index=False)
Here the mask is the boolean conditional where we have true if zscore exceeds the limit. Based on this boolean mask I create a string dataframe 'style_df' with the values 'background: red' on the deviating cells. The values of the style_df is imposed with the last statement on the style of the df dataframe.
The resulting excel file looks now like this
Upvotes: 2